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Preface 



The fusion of different information sources is a persistent and intriguing issue. It 
has been addressed for centuries in various disciplines, including political science, 
probability and statistics, system reliability assessment, computer science, and 
distributed detection in communications. Early seminal work on fusion was car- 
ried out by pioneers such as Laplace and von Neumann. More recently, research 
activities in information fusion have focused on pattern recognition. During the 
1990s, classifier fusion schemes, especially at the so-called decision-level, emerged 
under a plethora of different names in various scientific communities, including 
machine learning, neural networks, pattern recognition, and statistics. The dif- 
ferent nomenclatures introduced by these communities reflected their different 
perspectives and cultural backgrounds as well as the absence of common forums 
and the poor dissemination of the most important results. 

In 1999, the first workshop on multiple classifier systems was organized with 
the main goal of creating a common international forum to promote the dissem- 
ination of the results achieved in the diverse communities and the adoption of 
a common terminology, thus giving the different perspectives and cultural back- 
grounds some concrete added value. After five meetings of this workshop, there 
is strong evidence that significant steps have been made towards this goal. Re- 
searchers from these diverse communities successfully participated in the work- 
shops, and world experts presented surveys of the state of the art from the 
perspectives of their communities to aid cross- fertilization. The term multiple 
classifier systems currently appears in the list of topics of several international 
conferences, these workshop proceedings are often cited in journal and confer- 
ence papers, and tutorials on multiple classifier systems have been given during 
relevant international conferences such as Information Fusion 2002. Last, but not 
least, in the pattern recognition community, the term multiple classifier systems 
has been adopted as the main reference to this subject area. 

Following its four predecessors published by Springer- Verlag, this volume 
contains the proceedings of the 5th International Workshop on Multiple Classi- 
fier Systems (MCS 2004), held at Tanka Village, Cagliari, Italy, on June 9-11, 
2004. Thirty-five papers out of the 50 submitted from researchers in diverse 
research communities were selected by the scientific committee, and they were 
organized into sessions dealing with bagging and boosting, combination and de- 
sign methodologies, analysis and performance evaluation, and applications. The 
workshop program and this volume were enriched by two invited talks given by 
Ludmila I. Kuncheva (University of Wales, Bangor, UK), and Nageswara S.V. 
Rao (Oak Ridge National Laboratory, USA). 

This workshop was supported by the Department of Electrical and Electronic 
Engineering of the University of Cagliari, Italy, the University of Surrey, Guild- 
ford, UK, and Gruppo Vitrociset. Their support is gratefully acknowledged. We 
also thank the International Association for Pattern Recognition and its Techni- 
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cal Committee TCI on Statistical Pattern Recognition Techniques for sponsoring 
MCS 2004. 

We wish to express our appreciation to all those who helped to organize 
MCS 2004. First of all, we would like to thank all the members of the Scientific 
Committee whose professionalism was instrumental in creating a very interesting 
scientific program. Special thanks are due to the members of the Organizing 
Committee, Giorgio Giacinto, Giorgio Fumera, and Gian Luca Marcialis, for 
their indispensable contributions to the MCS 2004 Web site management, local 
organization, and proceedings preparation. 



June 2004 Fabio Roli, Josef Kittler, Terry Windeatt 
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Classifier Ensembles for Changing Environments 



Ludmila I. Kuncheva 

School of Informatics, University of Wales, Bangor 
Bangor, Gwynedd, LL57 lUT, United Kingdom 
1 . i . kunchevaSbcOigor .ac.uk 



Abstract. We consider strategies for building classifier ensembles for 
non-stationary environments where the classification task changes during 
the operation of the ensemble. Individual classifier models capable of 
online learning are reviewed. The concept of “forgetting” is discussed. 
Online ensembles and strategies suitable for changing environments are 
summarized. 

Keywords: classifier ensembles, online ensembles, incremental learning, 
non-stationary environments, concept drift. 



1 Introduction 

“All things flow, everything runs, as the waters of a river, which seem 
to be the same but in reality are never the same, as they are in a state 
of continuous flow. ” 



The doctrine of Heraclitus. 

Most of the current research in multiple classifier systems is devoted to static 
environments. We assume that the classification problem is fixed and we are 
presented with a data set, large or small, on which to design a classifier. The 
solutions to the static task have marvelled over the years to such a perfection 
that the dominance between the classification methods is resolved by a fraction 
of percent of the classification accuracy. Everything that exists changes with time 
and so will the classification problem. The changes could be minor fluctuations 
of the underlying probability distributions, steady trends, random or systematic, 
rapid substitution of one classification task with another and so on. 

A classifier (individual or an ensemble)^, if intended for a real application, 
should be equipped with a mechanism to adapt to the changes in the environ- 
ment. Various solutions to this problems have been proposed over the years. 
Here we try to give a systematic perspective on the problem and the current 
solutions, and outline new research avenues. 

The paper is organized as follows. Section 2 sets the scene by introducing 
the concept of changing environment. Online classifier models are presented in 
Section 3. Section 4 details some ensemble strategies for changing environments. 

^ Unless specified otherwise, a classifier is any mapping D : 5R" ^ J7 from the feature 
space, 5R", to the set of class labels 1? = {wi, . . . , uic}. 



F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 1—15, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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2 The Changing Environment 

Changing environments pose the main hurdle in many applications. One of the 
most acute examples is detecting and filtering out spam e-mail^. The descrip- 
tions of the two classes of e-mail, “spam” and “non-spam”, evolve with time; 
they are user-specific, and user preferences may also evolve with time. Besides, 
the important discriminating variables used at time t to classify spam may be 
irrelevant at a future moment t + k. To make matters even worse, the variability 
of the environment in this scenario is hardly due to chance. The classifier will 
have to face an active opponent - the “spammers” themselves - who will keep 
coming up with ingenious solutions to trick the classifier into labeling a spam 
e-mail as legitimate. 

2.1 Concept Drift and Types of Changes 

Viewed in a probabilistic sense, a classification problem may change due to the 
changes in [13] 

— Prior probabilities for the c classes, P{oJi), ■ ■ ■ , P{uJc)', 

— Class-conditional probability distributions, p(x|u;i), i = 1, . . . , c; or 

— Posterior probabilities P{u}i\n), i = 1, . . . , c. 

Not every change in the distribution is going to degrade the performance of a 
classifier. Consider the minimum-error classifier which labels x as the class index 
of the largest posterior probability P(wi|x). If the largest posterior probability 
for every x e 3?” keeps its class index, then the decision of the classifier for 
this X will guarantee the minimum error no matter what the changes in the 
distributions are. Kelly et al. [13] call the changes in the probabilities population 
drift and remark that a notion of concept drift used in machine learning literature 
is the more general one. 

While in a natural system we can expect gradual drifts (e.g., seasonal, demo- 
graphic, habitual, etc.), sometimes the class description may change rapidly due 
to hidden contexts. Such contexts may be, for example, illumination in image 
recognition and accents in speech recognition [24]. The context might instantly 
become highly relevant. Consider a system trained on images with similar illumi- 
nation. If images with the same type of content but a different illumination are 
fed to the system, the class descriptions might change so as to make the system 
worse than a random guess. The type of changes can be roughly summarized as 
follows 

— Random noise [1,23] 

— Random trends (gradual changes) [13] 

— Random substitutions (abrupt changes) [23] 

— Systematic trends (“recurring contexts”) [23] 

^ The term SPAM is coined to denote unsolicited e-mail, usually of commercial or 
offensive nature. 
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Depending on the type of changes, different strategies for building the clas- 
sifier may be appropriate. The noise must not be modelled by the classifier but 
filtered out. On the other hand, if there are systematic changes whereby similar 
class descriptions are likely to reappear, we may want to keep past successful 
classifiers and simply reuse them when appropriate. If the changes are gradual, 
we may use a moving window on the training data. If the changes are abrupt 
we may choose to use a static classifier and when a change is detected, pause 
and retrain the classifier. This scenario is important when verification of the 
class labels of the streaming data is not easily available. In general, the more is 
known about the type of the context drift, the better the chances of devising a 
successful updating strategy. 



2.2 Detecting a Change 



Unlabeled Data. Sometimes online labeling is not straightforward. For exam- 
ple, in scanning mammograms for lesions, a verified diagnosis from a specialist 
is needed if we want to reuse the processed images for further learning. An- 
other example is the credit application problem [13] where the true class label 
(good/bad) becomes known two years after the classification has taken place. In 
spam e-mail filtering, the user must confirm the legitimacy of every message if 
online training is intended. 

In case of unknown labels of the streaming data the classifier should be 
able to signal a potential concept drift based on the unlabeled data. This is 
typically based on monitoring the unconditional probability distribution, p(x). 
This problem is called novelty detection [18]. It is related to outlier detection in 
statistics. 

One possible practical solution is to train an additional “classifier” to model 
p(x) and compare the value for each input x with a threshold 9. If x is accepted 
to having come from the distribution of the problem (p(x) > 9), the system 
proceeds to classify it. Else we refuse to classify x and increment the count of 
novel objects. We can keep the count over a window of past examples. When the 
proportion of novel examples reaches a certain level, the system should either 
go to a halt or request a fresh labeled sample to retrain itself (if this option is 
incorporated). This addition to the original classifier requires 3 parameters to 
be specified by the user: 9, the threshold for the novelty; the threshold for the 
proportion of novel examples; and finally the size of the window. In the simplest 
case, the distance to the nearest neighbor from the current training data set 
can be used and 9 could be a threshold distance. In theory this technique is 
equivalent to nonparametric modelling of p(x). More sophisticated modelling 
approaches include Gaussian Mixture Modelling, Hidden Markov Models, kernel 
approximation, etc. (see the survey [18]). 

Note that using only the unconditional distribution, p(x), we may miss im- 
portant changes in the class distributions which may alter the classification task 
but not p(x) . 
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Labeled Data. The most direct indication that there has been an adverse 
change in the classification task is a consistent fall in the classification accuracy, 
either sudden or gradual. 

2.3 Learn to Forget 

Suppose that the classifier has the ability to learn on-line. Instead of trying to 
detect changes, the classifier is kept constantly up-to-date, which is called “any 
time learning'^ This means that if the system is stopped at time t, we have 
the best classifier for the classification problem as it is at time t. The classifier 
should be able to learn new class descriptions and “forget” outdated knowledge, 
or “unlearn” . The main problem is how to choose the rate of forgetting so that 
it matches the rate and the type of the changes [15,23]. If the character of the 
changes is known or at least suspected, e.g., gradual seasonal changes, then an 
optimal forgetting strategy can be designed. 

Forgetting by Ageing at a Constant Rate. The most common solution is 
forgetting training objects at a constant rate and using the most recent training 
set to update (re-train) the classifier. The current classifier uses the past w 
objects. When a new object arrives, the classifier is updated so as if trained on 
a data set where the oldest observation is replaced by the newest one. 

How large a window do we need? If the window is small, the system will be 
very responsive and will react quickly to changes but the accuracy of the classifier 
might be low due to insufficient training data in the window. Alternatively, a 
large window may lead to a sluggish but stable and well trained classifier. The 
compromise between the two is viewed as the “stability-plasticity dilemma” [15] 
whose solution lies with adjusting the “forgetting” parameters for the concrete 
problem. 

Forgetting by Ageing at a Variable Rate. If a change is detected, the 
window is shrunk (past examples are forgotten)^. For a static bout, the window 
is expanded to a predefined limit. 

To illustrate the concepts being introduced we will keep along a synthetic 
example taken from [23]. There are 3 categorical features with three categories 
each: size G {small, medium, large}, colour G {red, green, blue} and shape G 
{square, circular, triangular}. There are three classification tasks to be learned. 
The first class to be distinguished is (size = small AND colour = red). 40 ex- 
amples are generated randomly (with a uniform distribution over the possible 
values) and labeled according to the class description. These are fed to the clas- 
sifier sequentially. An independent set of 100 objects labeled according to the 
current class description is generated for each of the 40 objects. The classifier is 
tested after each submission on the respective 100 examples. The class descrip- 
tion is changed at step 40, so that objects 41 to 80 are labeled according to class 
(colour = green OR shape = circular). The testing objects are generated again 
from a uniform random distribution and labeled according to the current class 
description. Finally, a last set of 40 objects is generated (from 81 to 120) and 

® FLORA 3 is an example of this group of methods [23]. 
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labeled according to class (size = small OR size = large). The largest of the two 
prior probabilities for the first stage is 0.89, for the second stage is 0.56, and for 
the third stage, 0.67. 

The importance of using a valid forgetting strategy is demonstrated in Fig- 
ure 1(a). The Naive Bayes classifier is trained incrementally by updating the 
probabilities for the classes with each new data point (each example). The plot 
shows the accuracies of three classifiers versus the number of examples processed. 
The graph is the average of 10 runs. The classifier based on windows of variable 
size appears to be more versatile and able to recover from an abrupt change of 
class concept than the classifier with a constant window and the “no forgetting” 
classifier. 




(a) Naive Bayes classifier (b) Hedge /3, Winnow and 

without and with forgetting Weighted Majority for an ensemble 

(fixed and variable window) of 3 ideal classifiers 

Fig. 1. Classification accuracy for online algorithms. 



Density-Based Forgetting. Sometimes older data points may be more useful 
than more recent points. Deleting training data based on the distribution is 
considered in adaptive nearest neighbors models [1 , 20] . A weight is attached to 
each data point. The weight may decay with time or may be modified depending 
on how often the object has been found as a nearest neighbor. 



3 Online Learning 

Online learning is becoming increasingly a center stage methodology in the cur- 
rent information era. Massive streams of data are being processed every day, 
exceeding by far the time and memory capacity of the ever improving mod- 
ern computational technology. Examples of large data streams can be found in 
telecommunications, credit card transactions, Internet searches, etc. [6,7,21,22]. 

Online learning, also termed incremental learning, is primarily focused on 
processing the data in a sequential way so that in the end the classifier is no worse 
than a (hypothetical) classifier trained on the batch data. A static environment 
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is typically assumed, so forgetting of the learned knowledge is not envisaged. 
Each submitted data point is classified and at some later stage its true label 
is recovered. The data point and its label are added to the training set of the 
classifier. The classifier is updated, minimally if possible, to accommodate the 
new training point. In the simplest case the true label is known immediately. 
Adapting to the changing environment may come as an effortless bonus in online 
learning as long as some forgetting mechanism is put into place. 

3.1 Desiderata for Online Classifiers 

A good online classifier should have the following qualities [7,21] 

— Single pass through the data. The classifier must be able to learn from each 
example without revisiting it. 

— Limited memory and processing time. Each example should be processed in 
a (small) constant time regardless of number of the examples processed in 
the past. 

— Any-time-learning. If stopped at time t the algorithm should provide the 
best answer. The trained classifier should ideally be equivalent to a classifier 
trained on the batch data up to time t. An algorithm which produces a 
classifier functionally equivalent to the corresponding classifier trained on 
the batch data is termed a lossless online algorithm [19]. 

In fact, this list applies also for classifiers able to learn in changing environments. 
Any-time-learning accounts for the need for adapting the training in case of a 
concept drift. 

3.2 Online Classifier Models and Their Extensions 
for Changing Environments 

One of the oldest online training algorithms is the Rosenblatt’s perceptron even 
though it only possesses the second one of the desired qualities. 

Learning Vector Quantization (LVQ). LVQ is a simple and successful on- 
line learning algorithm originated also from the neural network literature [14]. 
The principle is as follows. We initialize (randomly or otherwise) a set of points 
labeled in the classes of the problem. The points are called prototypes. At the 
presentation of a new object, x S 3?”, the prototypes’ locations in the feature 
space are updated. The prototypes of the same class as x are pulled toward x 
while the prototypes from different classes are pushed away. In the simplest ver- 
sion only the nearest to x prototype is updated. The magnitude of the movement 
is controlled by a parameter of the algorithm, a, called the learning rate. Let 
V G 3?” be a prototype. The new value of the prototype is 

new _ / ''^ + ~ if the same class label as x 

1 V — a(x — v), otherwise 

For training a classifier in stationary environment the learning rate a is typically 
chosen as a decreasing function of the number of iterations so that at the end of 
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the algorithm only small changes take place. We can regard a as a function of 
the accuracy of the classifier and enforce more aggressive training when a change 
in the environment is detected. 

Decision Trees. To understand how the updating works, recall how a decision 
tree is built. Starting with a root node, at each node we decide whether or not the 
tree should be split further. If yes, a feature and its threshold value are selected 
for that node so as to give the best split according to a specified criterion. Upon 
receiving a new object x, it is propagated down the tree to a leaf forming a 
path Px- The counts of training objects reaching the nodes on the path are 
updated. For every internal node, the feature chosen for the split is confirmed 
along with the threshold value for the split. All necessary alterations are made 
so that the current tree is equivalent to a tree built upon the whole data set of 
all past x’s. 

Updating a tree may be an intricate job and may take longer than may be 
acceptable for an online system processing millions of records a day. Domingos 
and Hulten [6] propose a system for this case called VFDT (Very Fast Decision 
Trees). They use Hoeffding bound to guarantee that the VFDT will be asymp- 
totically equivalent to the batch tree. The Hoeffding bound allows to calculate 
the sample size needed to estimate a confidence interval for the mean of the vari- 
able of interest, regardless of the distribution of the variable^. The importance 
of the bound is that after a certain number of input points has been processed, 
there is no need to update the tree further. The VFDT system is guaranteed to 
be asymptotically optimal for static distributions [6,7]. Hulten et al. [11] extend 
VFDT to Concept-adapting VFDT to cope with changing environment. 

Naive Bayes. In the Naive Bayes model, the class-conditional probabilities are 
updated with each new x. For simplicity, suppose that x consists of n categorical 
variables xi,X 2 , ■ ■ ■ ,a:„. For each variable, we keep a probability mass function 
over the possible categories for each of the classes, P{xk\uJi). When a new x is 
received, labeled in class tOj, the probabilities for coj are updated. For example, 
let X = (small, blue, circular) and let the label of x be wi. Suppose that 99 objects 
from class Wi have been processed hitherto. Let P{xi = S jwi) = 0.2, P{xi = M 
|wi) = 0.7, and P{xi = L jwi) = 0.1. The updated values for xi are 



n 9 y qq 4- 1 

P{X1 = Sjcci) = ^ = 0.208 P(X1 = M|a.i) 

0 1 X 99 

P(*i = Ljcci) = = 0.099. 



0.7 X 99 
100 



0.693 



^ The Hoeffding bound is also known as additive Chernoff bound [6]. It states that for 
any <5 € (0, 1), with probability at least 1 — 5 and irrespective of the true distribution 
of the random variable y with range R, the true mean of y is within e of the mean 
y calculated from an i.i.d. sample of size n taken from the distribution of y where 



ln(2/5) 

2n 



e = 
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Naive Bayes classifier is a lossless classifier. A forgetting mechanism can be 
incorporated by keeping a window and “unlearning” the oldest example in the 
window while learning x. 

Neural Networks. The training of neural networks is carried out by submitting 
the training set in a sequential manner a specified number of times (number 
of epochs). To train a NN online we abandon the epoch protocol and use the 
continuous stream of data instead. NN classifier is not a lossless model [19]. 
If we keep the training on-going with the data stream, the NN parameters will 
follow the concept drift . The responsiveness of the NN to changes will depend on 
the learning rate used in the backpropagation algorithm. A neural architecture 
called PFAM (Probabilistic Fuzzy ARTMAP neural network) is used as the base 
classifier for an ensemble suitable for online learning in [15]. 

Nearest Neighbor. Nearest neighbour classifier is both intuitive and accurate. 
We can build the training set (the reference set from which the neighbors are 
found) by storing each labeled x as it comes. This model is called IBl (instance- 
based learning, model 1) in [1]. IBl is a lossless algorithm but it fails on the 
criteria for time and memory. IB2 accepts in the reference set only the x’s which 
are misclassified using the reference set at the time they arrive. This is the 
online version of the Hart’s editing algorithm [10]. Editing algorithms look for the 
minimal possible reference set with as high as possible generalization accuracy. 
IBS introduces forgetting based on the usefulness of the x’s in the reference set. 
There have been many studies on editing for the nearest neighbour algorithm 
[2,5]. Developing efficient online editing algorithms could be one of the important 
future directions. 

4 Ensemble Strategies for Changing Environments 

When and why would an ensemble be better than a single classifier in a changing 
environment? 

In massive data streams we are interested in simple models because there 
might not be time for running and updating an ensemble. On the other hand, 
Wang et al. argue that a simple ensemble might be easier to use than a decision 
tree as a single adaptive classifier [22] . 

When time is not of primary importance but very high accuracy is required, 
an ensemble would be the natural solution. An example is scanning mammo- 
grams for tissue lesions or cervical smear images for cell abnormalities. In these 
cases taking several minutes per image will be acceptable. 

Various online ensemble solutions have been proposed for changing environ- 
ments. We can group the approaches as follows 

— Dynamic combiners (or “horse racing” algorithms) . It this group we put the 
ensemble methods where the individual classifiers (experts) are trained in 
advance and the changes in the environment are tracked by changing in the 
combination rule. 

— Updated training data. The algorithms in this group rely on using fresh data 
to make online updates of the team members. The combination rule may or 
may not change in the process. 
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— Updating the ensemble members. Classifiers in the online ensemble can be 
updated online or retrained in a batch mode if blocks of data are available. 

— Structural changes of the ensemble. “Replace the loser” is one possible strat- 
egy from this group. In the case of a change in the environment, the individ- 
ual classifiers are re-evaluated and the worst classifier is replaced by a new 
classifier trained on the most recent data. 

— Adding new features. The features will naturally change their importance 
along the life of the ensemble. There might be a need for incorporating new 
features without going through the loop of re-designing the entire ensemble 
system. 

Some of the approaches are detailed below. 

4.1 “Horse Racing” Ensemble Algorithms 

We will follow the horse racing analogy aptly used in [8] to introduce the Hedge- 
(3 algorithm and AdaBoost. Assume that you want to predict the outcome of 
a horse race and you have L expert-friends whose prediction you can take or 
ignore. Each x is a particular race and the class label is its outcome. You note 
which of the experts have been wrong in their prediction and update a set of 
weights to keep track on whose prediction is currently most trustworthy. The 
most famous representative of this group is the Weighted Majority algorithm 
by Littlestone and Warmuth [17], shown below. 

1. Given is a classifier ensemble T> = {Z?i, . . . , Dl}. 

Initialize all L weights as Wj = 1. 

2. Operation. For a new x, calculate the support for each class uji as the sum 
of the weights of all classifiers Di that suggest class label oji for x. 

Label x to the class with the largest support. 

3. Training. Observe the true label of x. Using a pre-defined (3 G [0, 1], update 
the weights for all experts whose prediction was incorrect as Wi ^ (3wi. 

4. Continue from Step 2. 

Hedge (3 operates in the same way as the Weighted Majority algorithm but in- 
stead of taking the weighted majority, one classifier is selected from the ensemble 
and its prediction is taken as the ensemble decision. The classifier is selected ac- 
cording to the probability distribution defined by the normalized weights. 

Winnow is another algorithm under the horse race heading [16]. The algorithm 
was originally designed for predicting the value of a (static) Boolean function, 
called target function, / : {0, 1}” ^ {0, 1}, by getting rid of the irrelevant vari- 
ables in the Boolean input. Translated into our ensemble terminology. Winnow 
follows the weighted majority algorithm but uses a different updating Step 3. 
The weights are reconsidered only if the ensemble gives a wrong prediction for 
the current input x. If classifier Di gives the correct label for x, its weight is 
increased as Wi <— awi, where a > 1 is a parameter of the algorithm. This is 
called a promotion step, rewarding the expert not so much for predicting cor- 
rectly but for predicting correctly in times when the global algorithm should 
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have listened to them more carefully [3]. If classifier Di gives an incorrect label 
for X, the demotion step takes place as Wi ^ Wija, ensuring that Di takes a 
share of the blame for the ensemble error. Good results have been obtained for 
0 = 2 [16]. Machine learning literature abounds in proofs of theoretical bounds 
and further variants of Winnow and Weighted Majority [24] . 

Figure 1(b) shows the average of 10 runs of Hedge (3, The Weighted Major- 
ity and the Winnow algorithms for the artificial data described above. Here we 
are “cheating” in a way because we suppose that the ensemble consists of three 
perfect classifiers, one for each stage. For example, if classifier D\ was always 
picked to make the decision for objects 1 to 40, classifier D 2 for objects from 41 
to 80, and classifier for objects from 81 to 120, the classification accuracy 
would be 100% for all objects. As the plot shows, the Weighted Majority cannot 
recover from changing the class description at object 40 even though the ensem- 
ble consists of the three optimal experts. Hedge [3 would follow the same pattern 
if applied in its standard form. Here we modified it a little. Instead of updating 
all the weights for each x, we update only the weight of the classifier selected to 
label X. Winnow is the undisputed winner between the three algorithms in the 
plot. It quickly discovers the right classifier for each part. Obviously, involving 
the ensemble accuracy in the weight updating step makes all the difference. 

Blum [3] describes the above algorithms as “learning simple things really 
well” . The problem with the horse racing approach, in the context discussed 
here, is that the individual classifiers are not re-trained at any stage. Thus their 
expertise may ware off and the ensemble may be left without a single adequate 
expert for the new environment. Classifiers in the original algorithms are taken 
from the whole (finite) set of possible classifiers for the problem, so there is no 
danger of lack of expertise. 

Mixture of experts type of training constitutes a special niche in the group of 
dynamic combiner methods. Originally developed as a strategy for training an 
ensemble of multi-layer perceptrons [12], the idea has a more generic standing as 
an on-line algorithm suitable for changing environments. The ensemble operates 
as an on-line dynamic classifier selection system and updates one classifier and 
the combination rule with each new example. Figure 2 shows a sketch of a train- 
ing cycle upon a presentation of a new data point x. The individual classifiers 
must be supplied with a forgetting mechanism so that outdated knowledge is 
“unlearned” . 

Good Old Combination Rules. Any combination rule can be applied for the 
online ensemble. Although there is a marked preference for majority vote [21] 
and weighted majority, there is no reason why we should stop here. Naive Bayes 
and BKS combination rules are explored in [15]. Both can be updated online. 

4.2 Updated Training Data for Online Ensembles 

Reusing the Data Points. Oza proposes a neat online bagging algorithm 
which converges to the batch bagging as the number of training examples and the 
number of classifiers tend to infinity [19]. The idea is that the training samples 
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Sequence 



Use a probability distribution, 
p(-Di|x), to choose a classifier, 
Dk,to label x 



Output 

• 

class label 





If label is incorrect, 






update p(Di|x) and Dk. 





Fig. 2. A generic on-line ensemble construction method based on dynamic classifier 
selection. 



for the L classifiers in the ensemble can be created incrementally. The base 
classifiers are trained using online (preferably lossless) classifier models. To form 
the training sample for classifier Di, Oza observes that each training data point 
appears K times in a training set of size N sampled with replacement from the 
available data set of size N. K is a binomial random variable, so the probability 
that a certain z appears K = k times in the tth training set is 



P{K = k) 






N-k 



When N is large the probability for selecting a particular data point is small, and 
the binomial distribution can be approximated by a Poisson distribution with 
A = 1. Therefore the probability P{K = k) is calculated as P{K = k) = . 

The probabilities for k = 0 . . . , 7 are tabulated below. Practically it is very 
unlikely that any z will be present in a sample more than 7 times. 



k 


0 1 2 3 4 5 6 7 


P{K = k) 


0.3679 0.3679 0.1839 0.0613 0.0153 0.0031 0.0005 0.0001 



The online bagging proposed by Oza is given below 

1. Pick the number of classifier in the ensemble, L. 

2. Training. For each new labeled data point x, for each classifier Di, i = 

sample from the Poisson distribution explained above to find k. 
Put k copies of x in the training set of Di. Retrain Di using an online 
training algorithm. 

3. Operation. Take the majority vote on the ensemble to label x. 

4. Continue from Step 2. 

The online boosting algorithm proposed next by Oza [19] draws upon a 
similar idea as his online bagging. The number of times that a data point appears 
in the training set of classifier Di is again guided by the Poisson distribution. 
However, the parameter of the distribution, A will change from one classifier to 
the next. We start with A = 1 to add copies of x to the first sample and train 
classifier D\ on it. If Di misclassifies x, A increases so that x is likely to feature 
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more often in the training set of D 2 ■ The subsequent As are updated in the same 
manner until all L classifiers are retrained. 

Filtering. Breiman suggests to form the training sets for the consecutive clas- 
sifiers as the data flows through the system, called Pasting small votes. [4]. 
The first N data points are taken as the training set for Di. The points coming 
next are filtered so that the next classifier is trained on a sample such that the 
ensemble built so far (just D\ for now) will misclassify approximately half of 
it. To form the sample for classifier run each new data point through the 

current ensemble. If misclassified, add it to the new training set. If correctly clas- 
sified, add it to the training set with probability where Cfc is an estimate 

of the error of the ensemble on the training set for classifier Dk- To calculate 
this estimate, Breiman suggests to use the smoothed version 

Cfe = 0.75 X Cfe-i -I- 0.25 X rj,, 

where Cfc-i is the generalization ensemble error® up to classifier Dfc_i and rj, is 
the ensemble error found during building the training set for classifier Dk- The 
smoothing is needed because if N is small, then the estimate r{k) will be noisy. 
In this way, L classifiers are subsequently trained on diverse data sets. 

Breiman proposes to use Ck as a stopping criterion. In case of static envi- 
ronment, the ensemble error will comfortably level off at some L and no more 
ensemble members will be needed. If, however, the environment changes, Cfc will 
start increasing again. Changes in the ensembles will be needed at this point. 
We can, for example, replace ensemble members selected through a certain cri- 
terion. Alternatively, we can keep an ongoing process of building classifiers for 
the ensemble and “forget” the oldest member each time a new member is added, 
thus keeping the ensemble size constant at L. 

The methods in this subsection have been proposed as variants of the corre- 
sponding batch methods for stationary environments. When faced with changing 
environments, we have to introduce a forgetting mechanism. One possible solu- 
tion would be to use a window of past examples, preferably of variable size, as 
discussed earlier. 

Using Data Blocks or Chunks. In this model we assume that the data come 
as blocks of data points at a time. The ensemble can be updated using batch 
mode training on a “chunk” of data [9,21,22]. The blocks can be treated as single 
items of data in the sense that we may train the ensemble on the most recent 
block, on a set of past blocks or on the whole set of blocks. A forgetting strategy 
needs to be incorporated in the model, e.g., a window of blocks. Ganti et al. [9] 
propose a change detection algorithm based on difference between blocks of data. 
They also consider selective choices of past blocks so that a certain pattern is 
modelled. For example, if the blocks come once a day and contain all the data 
for that day, it might be interesting to build an ensemble for the Wednesday’s 
classification problem and apply it on the subsequent Wednesdays. 

Figure 3 gives an illustration of the three data handling approaches. 

® This error may be measured on an independent testing set or on the so called “out- 
of-bag” data points. For details see [4]. 
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RE-USE DATA POINTS (Online bagging and boosting) 



Data stream 




Training data sets for the classifiers 
in the ensemble 




CHUNKS OF DATA (Block evolution) 




Data stream 



Fig. 3. Illustration of the three data handling approaches for building online classifier 
ensembles. 

4.3 Changing the Ensemble Structure 

The simplest strategy here can be named replace the oldest. We remove the 
oldest classifier in the ensemble and train a new classifier to take its place. 
The three methods of data handling described above can be run within this 
framework. 

We may adopt a more sophisticated strategy for pruning of the ensemble. 
Wang et al. [22] propose to evaluate all classifiers using the most recent “chunk” 
of data as the testing set. The classifiers whose error exceeds a certain thresh- 
old are discarded from the ensemble. We call this strategy replace the loser. 
Street and Kim [21] consider a “quality score” for replacing a classifier in the 
ensemble based on its merit to the ensemble, not only on the basis of its in- 
dividual accuracy. This quality score is a soft version of the Winnow approach 
for updating the weights of the classifiers. Suppose we are testing the ensemble 
members on a new data set. We start with equal weights for all ensemble mem- 
bers. For each x, let Pi be the proportion of votes for the majority class, P 2 be 
the proportion of votes for the second most voted for class. Pc be the propor- 
tion for the correct class of x, and Pi be the proportion for the class suggested 
by classifier Di. The score for Di is updated as Wi <— rci -b <5(x), where i5(x) is 

1 — jPi — P 2 I, if both the ensemble and Di are correct 

1 — jPi — Pcj, if the ensemble is wrong and Di is correct 

1 — [Pi — Pc I, if Pi is wrong (regardless of the ensemble) 
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Most of the methodologies for building ensembles considered hitherto have 
built-in methodologies for controlling the ensemble size. Alternatively, the en- 
semble size can be left as a parameter of the algorithm or a designer’s choice. 

5 Conclusions 

Knowing that ensemble methods are accurate, flexible and sometimes more ef- 
ficient than single classifiers, the purpose of this study was to explore classifier 
ensembles within changing classification environments. “Concept drift” has long 
been an ongoing theme in machine learning. Ensemble methods have been re- 
cently probed in this line. The frontier of advanced research in multiple classifier 
systems which we are reaching now is likely to bring new fresh solutions to the 
changing environment problem. 

The ensemble can be perceived as a living population - expanding, shrinking, 
replacing and retraining classifiers, taking on new features, forgetting outdated 
knowledge. Ideas can be borrowed from evolutionary computation and artificial 
life. Even if we are still a long way from putting together a toolbox and a 
theoretical fundament, the road ahead is fascinating. 
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Abstract. A generic fusion problem is studied for multiple sensors whose 
outputs are probabilistically related to their inputs according to unknown 
distributions. Sensor measurements are provided as iid input-output sam- 
ples, and an empirical risk minimization method is described for design- 
ing fusers with distribution-free performance bounds. The special cases 
of isolation and projective fusers for classifiers and function estimators, 
respectively, are described in terms of performance bounds. The isolation 
fusers for classifiers are probabilistically guaranteed to perform at least 
as good as the best classifier. The projective fusers for function estima- 
tors are probabilistically guaranteed to perform at least as good as the 
best subset of estimators. 



1 Introduction 

The information fusion problems have been solved for centuries in various dis- 
ciplines, such as political economy, reliability, pattern recognition, forecasting, 
and distributed detection. In multiple sensor systems, the fusion problems arise 
naturally when overlapping regions are covered by the sensors. Often, the in- 
dividual sensors can themselves be complex, consisting of sophisticated sensor 
hardware and software. Consequently, sensor outputs can be related to the ac- 
tual object features in a complicated manner, and these relationships are often 
characterized by probability distributions. Early information fusion methods re- 
quired statistical independence of sensor errors, which greatly simplified the 
fuser design; for example, a weighted majority rule suffices in detection prob- 
lems. Such solutions are not applicable to current multiple sensor systems, since 
the sensors measurements can be highly correlated and consequently violate the 
statistical independence property. Another classical approach to fuser design is 
the Bayesian method that minimizes a suitable expected risk, which relies on 
analytical expressions for sensor distributions. Deriving the required closed- form 
sensor distributions is very difficult since it often requires the knowledge of areas 
such as device physics, electrical engineering, and statistical modeling. Partic- 
ularly when only a finite number of measurements are available, the selection 
of a fuser from a carefully chosen function class is easier, in a fundamental 
information-theoretic sense, than inferring completely unknown sensor distribu- 
tions [21]. 
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In operational sensor systems measurements are collected by sensing objects 
and environments with known parameters. Thus fusion methods that utilize such 
empirical observational or experimental data will be of high practical relevance. 
In this paper, we present a brief overview of rigorous approaches for designing 
such fusers based on the empirical process theory [21] to provide performance 
guarantees based on finite samples. We briefly describe a general fuser design 
approach and illustrate it using a vector space method. A more detailed account 
of the generic sensor fusion problem can be found in [18]. The problem of com- 
bining outputs of multiple classifiers is a special case of the generic sensor fusion 
problem, wherein the training sample corresponds to the measurements. We de- 
scribe the isolation fuser methods for classifiers to probabilistically ensure that 
fuser’s performance guarantees are at least as good as those of best classifier. 
The fusion of function estimators is another special case of the sensor fusion 
problem which is of practical utility. We then describe the nearest-neighbor pro- 
jective fuser for function estimators that performs at least as good as the best 
projective combination of the estimators. Both isolation and projective fusers 
have been originally developed for the generic sensor fusion problem, and we 
sharpen the general performance results for these special cases. 

This paper presents a brief account of results from other papers. We describe 
the classical sensor fusion methods in Section 2. We present a generic sensor 
fusion problem and a solution using empirical risk minimization in Section 3. We 
describe the problem of fusing classifiers and function estimators in Sections 4 
and 5, respectively. The original notations from the respective areas are retained 
in the individual sections; while it results in a non-uniform notation, it makes it 
easier to relate these results to the individual areas. 

2 Classical Fusion Problems 

Fusion methods for multiple sources to achieve performances exceeding those 
of individual sources have been studied in political economy models in 1786 
and composite methods in 1818. In the twentieth century, fusion methods have 
been applied in a wide spectrum of areas such as reliability, forecasting, pattern 
recognition, neural networks, decision fusion, and statistical estimation. A brief 
overview of early information fusion works can be found in [7]. The problem 
of fusing classifiers is relatively new and is first addressed in a probabilistic 
framework by Chow [1] in 1965. 

When sensor distributions are known, several fusion rule estimation problems 
have been solved under various formulations. A simpler version of this problem 
is the Condorcet jury model (see [4] for an overview), where a majority rule 
can be used to combine 1-0 probabilistically independent decisions of a group of 
N members. If each member has probability p of making a correct decision, the 

N 

probability that the majority makes the correct decision is ~ 

i=N/2 * 

p-^N-i ^ Then we have an interesting dichotomy: (a) if p > 0.5, then p^ > p and 
Pat ^ 1 as ^ oo; and (b) if p < 0.5, then p^ < p and pn ^ 0 as N ^ oo. 
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For the boundary case p = 0.5 we have pN = 0.5. Interestingly, this result 
has been rediscovered by von Neumann in 1959 in building reliable computing 
devices using unreliable components by taking a majority vote of duplicated 
components. For multiple classifiers, a weighted majority fuser is optimal [1] 
under statistical independence, and the fuser weights can be derived in a closed- 
form using the classifier detection probabilities. Over the past few years, multiple 
classifier systems have witnessed an extensive interest and growth [5,23]. 

The distributed detection problem [22] studied extensively in the target track- 
ing area can be viewed as a generalization of the above two problems. The 
Boolean decisions from a system of detectors are combined by minimizing a 
suitably formulated Bayesian risk function. The risk function is derived from 
the detector densities and the minimization is typically carried out using an- 
alytical or deterministic optimization methods. In particular, the risk function 
used for classifier fusion in [ 1 ] corresponds to the misclassification probability 
and its minima is achieved by the weighted majority rule. In these works, the 
sensor distributions are assumed to be known, which is reasonable in their do- 
mains. While several of these solutions can be converted into sample-based ones 
[9], these are not primarily designed for measurements. As evidenced in prac- 
tical multiple sensor systems and classifiers, it is more pragmatic to have the 
measurements rather than the error distributions. 

3 A Generic Sensor Fusion Problem 

We consider a multiple sensor system of N sensors, where sensor Si, i = 1,2, , 

N, outputs Y^^'> G 3?*^ corresponding to input A G 3?^* according to distribution 
Py(i)|jf. Intuitively, input X is the “measured” quantity such as presence of a 
target or a value of feature vector. The expected error of sensor Si is defined as 

i{s,) = J c(A,yw)dPy„,^, 

where C : 3?^^ x 3?^^ i-^- 3? is the cost function. Here, / (Si) is a measure of how good 
sensor Si is in “measuring” input feature A. If S'i is a detector [22] or classifier 
[2], we can have A G {0,1} and Y^^'> G {0,1}, where A = 1 (0) corresponds 
to presence (absence) of a target. Then I{Si) = f [A 0 AO)] dPYO)^x is the 
probability of misclassification (false alarm and missed detection) of Si, where 
0 is the exclusive-OR operation^. 

The measurement error corresponds to the randomness in measuring a par- 
ticular value of feature A, which is distributed according to PY(i)\x- The sys- 
tematic error at A corresponds to E[C{X,Y^'^'>)\X] which must be 0 in the case 
of a perfect sensor. This error is often referred to as the bias error. 

We consider a fuser / : 3?'^'^ i-^- 3?*^ that combines the outputs of sensors 
Y = . . . , to produce the fused output f{Y). We define the 

expected error of the fuser / to be 

^ Alternatively, A can be expanded to include the usual “feature” vector and C{.) can 
be redefined so that I (Si) is misclassification probability. 
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lF{f) = I C{X,f{Y))dPY,X 

where Y = (Y^^\Y^‘^\ . . . ,Y^^^Y The objective of fusion is to achieve low 
values of Ipif), and for this both systematic and measurement errors must be 
taken into account. The fuser is typically chosen from a family of fusion rules 
J- = {f : 3?^^} which could be either explicitly or implicitly identified. 

The expected best fusion rule f* satisfies If(/*) = min/F(/). For example, if T 

f 

is a set of sigmoidal neural networks obtained by varying the weight vector for 
a fixed architecture, then f* = corresponds to the weight vector w* that 
minimizes If{-) over all weight vectors. 

In this formulation, since If(-) depends on Py,x, f* cannot be computed 
even in principle if the distribution is not known. We consider that only an 
independently and identically distributed (iid) Fsample {Xi,Yi), (X 2 ,Y 2 ), ..., 
(Xi,Yi) is given, where Yi = (y}^\y^“^\ . . . , and Y^^'^ is the output of Sj 

in response to input Xi. Our goal is to obtain an estimator /, based only on a 
sufficiently large sample, such that 



p; 



Y.X 



iF{f)-iF{n > 



< 6 



( 1 ) 



where e > 0 and 0 < <5 < 1, and Py x distribution of iid ^-samples. As per 

this condition the “error” of / is within e of optimal error (of /*) with probability 
1 — (5, irrespective of the sensor distributions. Since / is to be “chosen” from a 
potentially infinite set, namely T , based only on a finite sample, this condition 
is a reasonable target. Strictly stronger conditions are generally not possible to 
achieve. For example, consider the condition Py x[lF{f) > e] < (5 for the case 
case of classifiers P = {f : [0, 1]-^ {0, 1}}. This condition cannot be satisfied, 

since for any classifier f G P, there exists a distribution for which IfH) > 1/2— p 
for any p G [0, 1] (see Theorem 7.1 of [2] for details). 

Consider a simple two-sensor system such that Y'd) = aiX + Z, where Z is 
normally distributed with zero mean, and is independent of X , i. e. a constant 
scaling error and a random additive error. For the second sensor, we have = 
a 2 X + 62, which has a scaling and bias error. Let X be uniformly distributed 
over [0,1], and C[X,Y] = [X — F)^. Then, we have I{Si) = (1 — oi)^ and 
I{S 2 ) = (1 — 02 — ^2)^5 which are non zero in general. For 

we have IfH) = 0, since the bias 62 is subtracted from Y^"^^ and the multipliers 
cancel the scaling error. Such fuser can be designed only with a significant insight 
into sensors, in particular with a detailed knowledge about the distributions. To 
illustrate the effects of finite samples, we generate three values for X given by 
{0.1, 0.5, 0.9} with corresponding Z values given by {0.1, —0.1, —0.3}. The corre- 
sponding values for and Y^‘^'> are given by {O.loi-I-O.l, 0.5oi— 0.1, 0.9ai— 0.3} 
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and {0.1o2 + &2,0.5a2 + &2,0.9o2 + ^ 2 } respectively. Consider a linear fuser 
/ + W 2 Y^'^^ + W 3 . The following weights enable the fuser 

outputs to exactly match X values for each measurement: 



Wl 



1 

0.2-0.4ai 



W2 






O.lai + 0.1 
0.4ai + 0.1 



0.102 + &2 
0.4o2 



While these weights achieve zero error on the measurements they do not achieve 
zero value for Ip (even though a fuser with zero expected error exists and can be 
computed if the sensor distributions are given) . The idea behind the criterion in 
Eq 1 is to achieve performances close to optimal using only a sample. To achieve 
this a suitable T is selected first, from which a fuser is chosen to achieve small 
error on a sufficiently large sample, as will be illustrated subsequently. 

Due to the generic nature of the sensor fusion problem described here, it is 
related to a number of similar problems in a wide variety of areas. A detailed 
discussion of these aspects can be found in [18]. 



3.1 Empirical Risk Minimization 

Consider that the empirical error estimate 






is minimized by / G IF. Such a method corresponds to an ad hoc approach of 
choosing a class of fusers such as neural networks or linear fusers, and choosing 
a particular fuser to minimize the error within the class. Performance of such 
method, including the basic feasibility, depends on the fuser class and the com- 
plexity of minimizing the empirical error. For example, if IF has finite capacity 
[21], then under bounded error, or bounded relative error for sufficiently large 
sample, we have Py x — Ipif*) > e < (5 for arbitrarily specified e > 0 

and (5, 0 < (5 < 1. Typically, the required sample size is expressed in terms of e 
and 5 and the parameters of T . The most general result [13] that ensures this 
condition is based on the scale-sensitive dimension, which establishes the basic 
tractability of this problem. But this general method often results in very loose 
bounds for the sample size, and tighter estimates are possible by utilizing specific 
properties of T . 

If IF is a vector space of dimensionality dy, we have the following results[12]: 
(a) the sample size is a simple function of c?y, (b) / can be computed using 
least square methods in polynomial time, and (c) no smoothness conditions are 
required on the functions or distributions. For simplicity consider that X G [0,1] 
and Y G [0,1]'^. Let f* and / denote the expected best and empirical best fusion 
functions chosen from a vector space IF of dimension dp and range [0, 1]. Given 
an iid sample of size 
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we have P 



lF{f)-lF{n>e 



two very important cases [12]: 



< 5 (see [12] for details). This method subsumes 



(a) Potential Functions: The potential functions where fi{y) is of the form 
exp{{y — a)‘^/P) for suitably chosen constants a and /3, constitute an ex- 
ample of the vector space method. 

(b) Special Neural Networks: In two-layer sigmoidal networks of [6], the unknown 
weights are only in the output layer, which enables us to express each network 

dy 

in the form ^ airji{y) with universal rn{.)’s. 

k=l 



Similar sample size estimates have been derived for fusers based on feedfor- 
ward neural networks in [10]. Also non-linear statistical estimators can be em- 
ployed to estimate the fuser based on the sample, such as the Nadaraya- Watson 
estimator [12]. The main limitation of empirical risk minimization approach is 
that / is only guaranteed to be close to f* but there are no guarantees that the 
latter is any good. While it is generally true that if F is large enough, f* would 
perform better than best sensor, it is indeed possible that it performs worse than 
worst sensor. Systematic approaches such as isolation fusers [17] and projective 
fusers [15] would be useful to ensure the fuser performance. We will subsequently 
discuss the special cases of isolation and projective fusers for classifiers [11] and 
function estimators [19], respectively. We note that projective fusers have also 
been applied to classifiers [11] and isolation fusers have also been applied to 
function estimators [16]. 



3.2 Example 

We consider 5 classifiers such that Y G {0, 1}® such that X G {0, 1} corresponds 
to “correct” class, which is generated with equal probabilities, i. e., P{X = 0) = 
P{X = 1) = 1/2 [20]. The error of classifier Ci, i = 1,2, ... ,5, is described as 
follows: the output is correct decision with probability of 1 — i/10, and is the 
opposite with probability z/10. The task is to combine the outputs of classifiers to 
predict the correct class. The percentage error of the individual classifiers and the 
fused system based on the Nadaraya- Watson estimator is presented in Table 1. 
Note that the fuser is consistently better than the best classifier C\ beyond the 
sample size of 1000. The performance results of Nadaraya- Watson estimator, 
empirical decision rule, nearest neighbor rule, and Bayesian rule based on the 

Table 1. Percentage error of Nadaraya- Watson estimator and individual classifiers. 



Sample 

Size 


Test 

set 


Cl 


C2 


Ga 


Ci 


Gs 


Nadaraya- 

Watson 


o 

o 


100 


7.0 


20.0 


33.0 


35.0 


55.0 


12.0 


1000 


1000 


11.3 


18.5 


29.8 


38.7 


51.6 


10.6 


10000 


10000 


9.5 


20.1 


30.3 


39.8 


49.6 


8.58 


50000 


50000 


10.0 


20.1 


29.8 


39.9 


50.1 


8.860 
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Table 2. Correct classification percentage of fusers. 



Sample 

Size 


Test 

Size 


Bayesian 

Fuser 


Empirical 

Decision 


Nearest 

Neighbor 


Nadaraya- 

Watson 


100 


100 


91.91 


23.00 


82.83 


88.00 


1000 


1000 


91.99 


82.58 


90.39 


89.40 


10000 


10000 


91.11 


90.15 


90.81 


91.42 


50000 


50000 


91.19 


90.99 


91.13 


91.14 



analytical formulas are presented in Table 2. The Bayesian rule is computed 
based on the formulas used in the data generation and is provided for comparison 
only. 



4 Isolation Fusers for Classifiers 



Over the past decades several methods, such as nearest neighbor rules, neural 
networks, tree methods, and kernel rules, have been developed for designing clas- 
sifiers. Often, the classifiers are quite varied and their performances are charac- 
terized by various smoothness and/or combinatorial parameters [2]. The designer 
is thus faced with a wide variety of choices which are not easily comparable. It 
is generally known that a good fuser outperforms the best classifier, and at the 
same time, a bad fuser choice can result in a performance worse than the worst 
classifier. Thus it is very important to employ fusion methods that provide con- 
crete performance guarantees - in particular, for the fuser to be reasonable it 
must perform at least as well as the best classifier. 

We are given an independently and identically distributed (iid) sample 
{Xi,Yi), (X2,Y2), . . ., (Xn,Yn), according to an unknown distribution Px,y, 
where Xi G 3?'^ and Yi G {0, 1}. The problem is to design a classifier i-s- 

{0, 1} based on the sample that ensures a small value for the probability of mis- 
classification 

L{4>) = J I{4,(x)^Y}dPx,Y, 

X 

where Id{x) is the indicator function of the set D C such that Ic{x) = 1 
if X G C and Ic{x) = 0 otherwise. We often suppress the operand x when it is 
clear from the context. 

For (j) gH, the empirical error of misclassification is given by 

1 " 



Let (j) minimize L(.) over 7i. If has finite Vapnik-Chervonenkis dimension Vh, 
we have [2] 



P 



X,Y 



L{6) — min L{6) > e 



< (5 
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for sufficiently large n, irrespective of the distribution Px,y- We are given N 
such classifiers corresponding to the classes ■ ■ ■ ,'Hn such that 



P 



X,Y 






min L(d)) > e 



< Si 



where (j)i minimizes L{.) over Tii. Our objective is to “fuse” the classifier outputs 
so that the fused system performs at least as well as the best individual classifier 
based on the sample only. We next describe a method based on the isolation 
property that enables us to compare the fused system with the best individual 
classifier [11]. This method is simple to apply and requires easily satisfiable 
criteria. 



4.1 Single Classifier 

The lowest possible error achievable by any deterministic classifier is given by 
the Bayes error L{(j)*), where (p* : i-x {0, 1} is defined as 

r 1 if Px,y[Y = IjX = x] > Px,y[Y = 0|^ = a;] 

V [^) Q otherwise 

Since the distribution is not known, cp* cannot be computed. The performance 
of (p that minimizes L(.) can be characterized using the properties of 7i. 

Let .4 be a collection of measurable sets of R'^. For {zi,Z 2 , ■ • ■ , Zn) & {3?^^}”, 
let Na{zi,Z 2 , ■ • ■ , Zn) denote the number of different sets in {{^i, 22 , • ■ • , Zn}f^A : 
A G .4}. The nth shatter eoefficient of A is 

s(.4, n)= max Af^{zi, Z 2 , ■ ■ ■ , Zn)- 

(2:1,22,. ...Zn)G{SR‘^}" 

Then, the Vapnik-Chervonenkis (VC) dimension of A, denoted by V/i, is the 
largest integer k > 1 such that s{A,k) = 2^. The following important identity 
[21] relates the shatter coefficient to VC dimension: 



s(.4, n) 



2" if n < Va 
2^ifn>VA 



Then we have PJl y 



sup \L{(p) - L{(p)\ > e 
<P&A 



< 8s{A,n)e which in turn 



implies P 



X,Y 



L((p) — TrmiL{(p)\ > e 

4>&H 



given a sample 



of size n = (In s(7f, n) + ln(8/<5)) we have 



P 



X.Y 



Li 6) — min Licp) > e 



< s, 



irrespective of the distribution Px,y ■ 
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4.2 Isolation Fusers 

We consider a family of fuser functions T : {/ : {0,1}'^ {0,1}} such that the 

fused output is given by f[(j>i{X),(j) 2 {X), . . . ,(j)N{X)], denoted by f{Z), where 
Z = {^i{X),^ 2 {X), . . . , (j)N{X)). The error probability of the fused system is 

Lpif) = J I{f{Z)^Y}dPx,Y- 

Note that Z is a deterministic function of X given the sample. For computational 
convenience, we utilize the following alternative formula 

Lpif) = I mZ) - YfdPxx- 

r)N 

Note that \P\ < 2^ since T consists of at most all Boolean functions on N 
variables. Consider the function class 



g = {/(<^i(X), , MX)) : </>! G Hi, (/.2 G ^2, . . . , (/-W G Hiv} . 

Here f{(pi{.),(j)2{-), ■ ■ ■ ,4>n{-)) specifies a subset of 3?^^, and hence Q specifies a 
family of sets of 3?*^. 

The fuser is obtained in two steps: (a) a training set (Zi,Yi), (^ 2 , 12 ), • ■ •, 
{Zn, Yn), where Zi = {<j)i{Xi), cj)2{Xi), . . . , (j)x{Xi)), is derived from the classifiers 
and the original sample, and (b) the fuser is derived by minimizing empirical 
error over P. Let /* minimize Lp(.) over P. Consider the empirical error 

n 

LF{f) = -J 2 [f{Z.)-Y^^. 

” i=l 



Let / minimize Lp{.) over P. 

If one of the classifier is to be chosen, the lowest achievable error is given 

N 

by minL(</)*). Since the classifiers can be correlated in an arbitrary manner, the 

i—1 

empirically best classifier f^min = argminL(0i) yields the following guarantee 

i 



p 



X,Y 



N 



L{4>min) - TainL{4>*) > e 



< l5i + ^2 + ■ • ■ + <5 AT. 

The fuser, thus, provides a better guarantee if < i^i + ^2 + • • ■ + where 



P 



X.Y 



N 

Lp{f) -mmL{(j)*) > e 
1—1 



< Sp. 



The fuser class P satisfies the isolation property [17] if it contains the fol- 
lowing N functions: for all i = 1, 2, . . . , iV we have fi{z\, Z 2 , ■ ■ ■ , zn) = Zi. This 
property is trivially satisfied if P consists of all Boolean functions of TV variables. 
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Although it is sufficient to include N functions in IF to satisfy this property, in 
general a richer class performs better in practice [11]. 

If the fuser class T satisfies the isolation property, then fuser / provides 

N 2 

better guarantee than the best classifier under the condition |1F| < 1 ^ ^ 

Z— 1 

(see [11] for the proof). A minimal realization of this result can be based on 
^ = {/ij / 2 , ■ • ■ , /n} as per the isolation property. We wish to emphasize that 
this fusion method can be easily applied without identifying the best classifier, 
while still ensuring its performance in the fused system. The above condition 
can also be expressed in terms of the VC dimensions as follows 



N 



l-^l <4^ 

Z=1 



-£263w/128 



by noting that bi = i " e ^ > max(VHj, j [21]- 



5 Projective Fusers for Function Estimation 

The problem of function estimation based on empirical data arises in a number of 
disciplines such as statistics, systems theory, and computer science. As a result, 
there has been a profusion of function estimators, whose performance conditions 
could be quite involved and beyond the expertise of an average practitioner. 
Nevertheless, several of these estimators are based on considerable practical and 
theoretical insights, and it would be most desirable to retain their strengths. 

We are required to estimate a function / : [0, 1]*^ [0, 1], based on a finite 

sample (Vi, /(Vi)), (V2, /(V2)), . . . , (V;, /(V/)) where Xi, X2, ■ ■ ■ , Xi, for I < 
00 , are iid according to an unknown distribution Px on [0, 1 ]*^. For an estimator 
f of f we consider the expected square error given by 

/(/) = j{f{X)-f{X)fdPx. 

We are given N previously computed function estimators (as in [14]) each ob- 
tained by using an existing method. The individual estimator fi could be a po- 
tential function estimator, radial basis function, fc-nearest neighbor estimator, 
regressogram, kernel estimator, regression tree or another estimator. 

Given the estimators /i, / 2 , . . . , /at, we consider that the fuser is a function 
/f : [0,1]^ 1 -^ [0,1] such that /f(V, / i(V), / 2 (V), . . . , /Ar(V)) is the fused es- 
timate of f{X). The expected and empirical errors of the fuser are respectively 
given by 

/(/f) = I [f{X)-MX,h{X),h{X),...Jx{X))]^dPx 
1 ^ 

Hfp) = y J 2 ^f{X,) - /f(V„ /i(Vi), /2(Vi), . . . , fN(X,))f. 

Z=1 
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5.1 Class of Projective Fusers 

A projective fuser [15], fp, corresponding to a partition P = {tti, 7T2, . . . , tt^}, 

k 

k < N, oi input space [0, 1]*^ of X {-Ki C [0, l]*^, IJ = [0, l]'^, and m n tt j = (/) 

i=l 

for z ^ j), assigns to each block to an estimator fj such that 

fp{x,h,...jN) = Mx) 

for all X G TTi- For simplicity, we denote /p(X, /i, . . . , /v) by fp{X). An op- 
timal projective fuser, denoted by fp*, minimizes /(.) over all projective fusers 
corresponding to all partitions of [0, and assignments of blocks to estimators. 

We define the error curve of the estimator / for / as £{X,f) = (f{X) — 
/(A))^. The projective fuser based on lower envelope of error curves is defined 

by , , , 

/le(A, /i,...,/at) = 

where iLE{X) = arg min £{X, ft). In other words, fpE{X, fi, . . . , /v) sim- 
ply outputs the estimator with the lowest error at X. Thus, we have £(A, frE) = 

N 

mm£{X,fi), or equivalently the error curve of frE is the lower envelope with 

i—1 

respect to X of the set of error curves {£{X, fi), . . . ,£{X, /at)}- 

5.2 Nearest Neighbor Projective Fuser 

We partition the space of X into Voronoi regions V (Ai), V (A 2 ), . . . ,V (A/) such 
that 



V{X,) = {X :|| A - A, ||<|| A - Afe II for all k = 1,2, . . . ,1; k ^ j} 

where || . || is the Euclidean metric. The points equidistant from more than one 
sample point are arbitrarily assigned to one of the regions. We assume that all 
Aj’s are distinct without the loss of generality. V{Xj) is simply the set of all 
points that are at least as close to Xj as to any other A^. Let NN{X) = k such 
that A G V(Xk) for some k, which is the Voronoi cell that A belongs to. For 
the cell V{Xf^p[(^x)) that contains A, we identity the estimator that achieves 
the lowest empirical error at the sample point Xxn(x) by defining the estimator 
index of A as follows 

iNN{X) = arg min J^f{XxN{x)) ~ fi{^NN(x))Y- 

That is, iNN{X) is the index of the estimator that achieves least empirical error 
at the sample point Xxn(x) nearest to A. Then the nearest neighbor projective 
fuser [19] is defined as 

fxNiX, /i(A), . . . , /v(A)) = /j„„(x)(A). 

Despite the notational complexity, the idea of fxN is quite simple: fxNiX) is 
fi{X) that achieves least empirical error at the nearest sample point to A. 
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5.3 Sample-Based Projective Fusers 

The computation of fLE in general requires a complete knowledge of the distri- 
bution Px- To address the case where such knowledge is not available, a method 
was proposed in [15] that utilizes regression estimation methods to compute an 
estimator £{X, fi) of£{X, fi), and utilizes the lower envelope of these estimators 
in the computation of fuser. We now briefly outline the basic approach using the 
cubic partitions with data-dependent offsets for d = 1. For a sequence {hi} of pos- 
itive numbers, consider the partition of 3? given by 9i = {[(r — rhi)\r G Z}. 
Let "tpilX] denote the unique cell of 9i that contains X. Then, the estimator of 
£{XJ,) is given by 

fi) = — ^ ■ 

i=i 

In other words, the estimator simply computes the mean of the error of fi within 
the cell of 9i that contains X. Consider the conditions: (i) ((X — f{Y))^ < K 
for some K > 0; (ii) lim hi 0; and (iii) nhi ^ oo as / ^ oo. Then, we have 

I — »-oo 

J \£{X,fi) — £{X, fi)\'^dPx 0 with probability 1 [8], regardless of the dis- 
tribution Px- The fuser Jle is computed using £{X,fi) in place of £{X,fi). 
The strong consistency of /le method is shown under the boundedness of 
- /(^))^ namely IQle) I{fLE) as I ^ oo with probability 1 for any 
distribution Px [15]. This result specifies the performance of Jle for sample 
sizes approaching infinity and does not tell much when sample sizes are finite. 
The implementation of fLE itself is tricky in that the choice of hi is not evident 
if finite-sample performance is needed. 

The individual function estimators /i, / 2 , . . . , /at could be quite varied, but 
several of them satisfy certain smoothness or non-smoothness conditions. For 

any function g : [—A,A\‘^ i— > 3?, let |j g{r) ]|oo= sup |g(r)]. A function 

r^[-A,A]'i 

g{y) : [—A^Af^ ^ 3?“ is Lipschitz with constant kg if for all yi,y 2 G [—A,Af, 
we have jj g{yi) — 3 ( 2 / 2 ) ||oo< kg |j y\ — j /2 ||oo ■ The examples of smooth function 
estimators include potential functions, sigmoid neural networks, smooth kernel 
estimates, radial basis functions, linear and polynomial estimators. 

Several function estimators are not Lipschitz with popular examples includ- 
ing nearest neighbor and Nadaraya- Watson estimators. To address such cases we 
consider the class of functions with bounded variation, which allows for disconti- 
nuities and includes Lipschitz functions as a subclass. Consider one-dimensional 
function h : [—A, A] ^ 3?. For A < 00 , a set of points P = {yo, 31 , ... , y„} such 
that —A = yQ < yi < ... < yn = A is called a partition of [—A, A]. The col- 
lection of all possible partitions of [—A, A] is denoted by 7^[— A, A]. A function 
g : [—A, A] i-^- 3? is of hounded variation, if there exists the total variation M such 

n 

that for any partition P= {yo,yi,...,yn}, we have X! IdiVk) ~ g{yk-i)\ < M. A 
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multivariate function g : [—A, A\‘^ 3? is of bounded variation if it is so in each 

of its input variable for every value of the other input variables. The following are 
useful facts about the functions of bounded variation: (i) not all continuous func- 
tions are of bounded variation, e.g. g{y) = y cos{-k / {2y)) for y yf 0 and (/(O) = 0; 

(ii) differentiable functions on compact domains are of bounded variation; and 

(iii) absolutely continuous functions, which include Lipschitz functions, are of 
bounded variation. 

The function estimators such as fc-nearest neighbor, Haar wavelet estimators, 
regression tree, regressogram and Nadaraya- Watson estimator (which all could 
involve discrete jumps) satisfy the bounded variation property. Since Lipschitz 
estimators over compact domains also have bounded variation, the latter is a 
fairly general property satisfied by most of the widely-used estimators. 

We consider that the function estimators /i, . . . , /at are of bounded variation. 

Let each function estimator fi be of total variation Vi. For V = ^ hi, it is shown 

i=l 



in [19] that P /(/at at) - IUle) > e 



< 6 for sample size 



256 



18 




128 F ^ 



ln^(128/e)-f ln(16/(j) . 



Furthermore, I{fNN) — *■ IUle) as Z — > oo. This result establishes the analyt- 
ical viability of /nn for finite samples. While the sample size estimate is not 
necessarily within practical limits, the overall result itself is stronger than the 
asymptotic consistency. 



5.4 Computational Example 

We consider the problem of estimating 

f{X) = 0.02(12 -f 3X -f 7.2a;2)(1.0 -f cos(47tX))(1.0 -f 0.8 sin(37rX/7)) 

based on a sample. Two samples each of size 200 (Fig. 1(a)) are used in training 
the neural networks and fuser. Five feedforward neural networks are trained 
using the backpropagation algorithm with different starting weights and different 
learning rates as shown in Fig. 1(b). The performance of the estimators and fuser 
is measured by the empirical error on the sample. The estimator 1 approximated 
well only in the vicinity of X = 1, whereas estimator 2 is close to the function in 
the vicinity of X = 0. Estimator 3 provided a good approximation at both ends 
of the interval [0, 1] and is the best of the estimators. However, this estimator is 
insensitive to the variations of f{X) in the middle of the interval [0, 1]. Estimator 
4 performs the worst staying close to 0 for entire [0, 1]. The performance of Jnn 
is shown in Figure 1(c) which is uniformly as good as any of the estimators 
across the entire interval. The best estimator 3 is used by the fuser for the most 
of the interval [0, 1] except in the middle. It is interesting to note that the worst 
estimator, namely, estimator 4, is used in the lowest portions of f{X), and indeed 
is responsible for the better performance achieved by the fuser. 
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Fig. 1. Nearest neighbor projective fuser for function estimators. 



6 Conclusions 

A generic sensor fusion problem is formulated for sensors whose measurements 
are subject to unknown probability distributions. A brief overview of fuser design 
methods is presented with a focus on finite sample performance guarantees. The 
classes of isolation and projective fusers are described for the special cases of 
classification and function estimation. Similar concepts have been studied in 
multiple classifier systems [3, 23]. The methods described in this paper have been 
applied in practice for combining ultrasonic and infrared sensor measurements 
for robot navigation, prediction of embrittlement levels in light water reactors, 
combining sensor readings of well data in methane hydrate explorations, and 
combining radar measurements for target detection. 

Several open problems remain in the generic sensor fusion problem as well as 
in classification and function estimation. Often the sample bounds are too large 
to be practical, and the performance equations do not provide uniform precision 
in that the sensor with best bound is not necessarily the best. It would be in- 
teresting to develop principles that bridge the gap between performance bounds 
and actual performance. Also there has been a profusion of fusion concepts of 
significant diversity, and it would be interesting to identify unifying principles 
behind these developments. 
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Abstract. AdaBoost [4] is a well-known ensemble learning algorithm 
that constructs its base models in sequence. AdaBoost constructs a dis- 
tribution over the training examples to create each base model. This dis- 
tribution, represented as a vector, is constructed with the goal of making 
the next base model’s mistakes uncorrelated with those of the previous 
base model [5]. We previously [7] developed an algorithm, AveBoost, 
that first constructed a distribution the same way as AdaBoost but then 
averaged it with the previous models’ distributions to create the next 
base model’s distribution. Our experiments demonstrated the superior 
accuracy of this approach. In this paper, we slightly revise our algo- 
rithm to obtain non-trivial theoretical results: bounds on the training 
error and generalization error (difference between training and test er- 
ror). Our averaging process has a regularizing effect which leads us to a 
worse training error bound for our algorithm than for AdaBoost but a 
better generalization error bound. This leads us to suspect that our new 
algorithm works better than AdaBoost on noisy data. For this paper, we 
experimented with the data that we used in [7] both as originally sup- 
plied and with added label noise - some of the data has its original label 
changed randomly. Our algorithm’s experimental performance improve- 
ment over AdaBoost is even greater on the noisy data than the original 
data. 



1 Introduction 

AdaBoost [4] is one of the most well-known and highest-performing ensemble 
classifier learning algorithms [3]. It constructs a sequence of base models, where 
each model is constructed based on the performance of the previous model on 
the training set. In particular, AdaBoost calls the base model learning algorithm 
with a training set weighted by a distribution^. After the base model is created, 
it is tested on the training set to see how well it learned. We assume that the 
base model learning algorithm is a weak learning algorithm] that is, with high 
probability, it produces a model whose probability of misclassifying an example 

^ If the base model learning algorithm cannot take a weighted training set as input, 
then one can create a sample with replacement from the original training set accord- 
ing to the distribution and call the algorithm with that sample. 
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is less than 0.5 when that example is drawn from the same distribution that 
generated the training set. The point is that such a model performs better than 
random guessing^. The weights of the correctly classified examples and misclassi- 
fied examples are scaled down and up, respectively, so that the two groups’ total 
weights are 0.5 each. The next base model is generated by calling the learning 
algorithm with this new weight distribution and the training set. The idea is 
that, because of the weak learning assumption, at least some of the previously 
misclassified examples will be correctly classified by the new base model. Previ- 
ously misclassified examples are more likely to be classified correctly because of 
their higher weights, which focus more attention on them. AdaBoost scales the 
distribution with the goal of making the next base model’s mistakes uncorrelated 
with those of the previous base model [5]. 

AdaBoost is notorious for performing poorly on noisy datasets [3], such as 
those with label noise - that is, some examples were randomly assigned the 
wrong class label. Because these examples are inconsistent with the majority 
of examples, they tend to be harder for the base model learning algorithm to 
learn. AdaBoost increases the weights of examples that the base model learning 
algorithm did not learn correctly. Noisy examples are likely to be incorrectly 
learned by many of the base models so that eventually these examples’ weights 
will dominate those of the remaining examples. This causes AdaBoost to focus 
too much on the noisy examples at the expense of the majority of the training 
examples, leading to poor performance on new examples. 

We previously [7] presented an algorithm, called AveBoost, which calculates 
the next base model’s distribution by first calculating a distribution the same 
way as in AdaBoost, but then averaging it elementwise with those calculated for 
the previous base models. This averaging mitigates AdaBoost’s tendency to in- 
crease the weights of noisy examples to excess. In our previous work we presented 
promising experimental results. However, we did not present theoretical results. 
In our subsequent research, we were unable to derive a non-trivial training error 
bound for the algorithm presented in [7]. In this paper, we present a slight mod- 
ification to AveBoost which allows us to obtain both a non-trivial training error 
bound and a generalization error bound (difference between training error and 
test error). We call this algorithm AveBoost2. In Section 2, we review AdaBoost. 
In Section 3, we describe the AveBoost2 algorithm and state how it is different 
from AveBoost. In Section 4, we present our training error bound and gener- 
alization error bound. The averaging in our algorithm has a regularizing effect; 
therefore, as expected, our training error bound is worse than that of AdaBoost 
but our generalization error bound is better than AdaBoost’s. In Section 5, we 
present an experimental comparison of our new AveBoost2 with AdaBoost on 
some UCI datasets [1] both in original form and with 10% label noise added. 
Section 6 summarizes this paper and describes ongoing and future work. 



^ The version of AdaBoost that we use was designed for two-class classification prob- 
lems. However, it is often used for a larger number of classes when the base model 
learning algorithm is strong enough to have an error less than 0.5 in spite of the 
larger number of classes. 
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AdaBoost({(xi, 2 / 1 ), . . . , {Xm, J/m)}, Lb, T) 
Initialize = 1/m for all i € {1, 2, . . . ,m}. 
For t = 1,2, ... ,T: 



Lt Lb{{{X\ , 2 / 1 ) , . . . , {Xm^ 1/m)} ; 

Calculate the error of ht : et = 

If et > 1/2 then, 

set T = t — 1 and abort this loop. 



Calculate distribution dt+i: 



Wi = dt,i X 
dt+l,i = 



/3t if ht{xi) = yi 
1 otherwise 



E} 



Wi 



Output the final hypothesis: 

hfi„{x) = argmaxj,gy T.t:ht(x)=y 



Fig. 1. AdaBoost algorithm: {{xi,yi), . . . , (xm, 2/m)} is the training set. Lb is the base 
model learning algorithm, and T is the maximum allowed number of base models. 



2 AdaBoost 

Figure 1 shows AdaBoost’s pseudocode. AdaBoost constructs a sequence of base 
models ht for t G {1,2, ... ,T}, where each model is constructed based on the 
performance of the previous base model on the training set. In particular, Ad- 
aBoost maintains a distribution over the m training examples. The distribution 
di used in creating the first base model gives equal weight to each example 
{di^i = 1/m for alH G (1, 2, . . . , m}). AdaBoost now enters the loop, where the 
base model learning algorithm Lf, is called with the training set and di. The 
returned model hi is then tested on the training set to see how well it learned. 
The total weight of the misclassified examples (ei) is calculated. The weights 
of the correctly-classified examples are multiplied by ei/(l — ei) so that they 
have the same total weight as the misclassified examples. The weights of all the 
examples are then normalized so that they sum to 1 instead of 2ei. AdaBoost 
assumes that L{, is a weak learner, i.e., et < ^ with high probability. Under this 
assumption, the total weight of the misclassified examples et < 1/2 is increased 
to 1/2 and the total weight of the correctly classified examples I — et > 1/2 is 
decreased to 1/2. This is done so that, by the weak learning assumption, the next 
model ht+i will classify at least some of the previously misclassified examples 
correctly. Returning to the algorithm, the loop continues, creating the T base 
models in the ensemble. The final ensemble returns, for a new example, the one 
class in the set of classes Y that gets the highest weighted vote from the base 
models. Each base model’s vote is proportional to its accuracy on the weighted 
training set used to train it. 
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AveBoost2({(xi, yi), . . . , {Xm, Vm)}, Lb,T) 

Initialize = 1/m for all i £ {1, 2, . . . ,m}. 

For t = 1,2, ... ,T: 

hi Z/& ({ (Xl , ) , . . . , {Xjn^ i/m)} ; 

Calculate the error of ht : et = '^*4- 

If et > 1/2 then, 

set T = t — 1 and abort this loop. 

^ 2(l-et)t+l 
2ett + l 

Calculate distribution dt+i: 

For t = 1, 2, . . . , m: 



Wi = dt,i X 



Pt if ht{xi) = m 
1 otherwise 









Wi 



E m 

i^lWi 

idt,i + Ct,i 

t 1 



Fig. 2. AveBoost2 algorithm: {(xi, j/i), . . . , (xm, 2/m)} is the training set. Lb is the base 
model learning algorithm, and T is the maximum allowed number of base models. 



3 AveBoost2 Algorithm 

Figure 2 shows our new algorithm, AveBoost2. Just as in AdaBoost, AveBoost2 
initializes di^i = 1/m for all i € {1,2,..., to}. Then it goes inside the loop, 
where it calls the base model learning algorithm Lf, with the training set and 
distribution di and calculates the error ei of the resulting base model hi. It then 
calculates Ci , which is the distribution that AdaBoost would use to construct the 
next base model. However, AveBoost2 averages this with di to get d 2 , and uses 
this d 2 instead. Showing that the dj’s in AveBoost2 are distributions is a trivial 
proof by induction. For the base case, di is constructed to be a distribution. For 
the inductive part, if dj is a distribution, then d^+i is a distribution because it is 
a convex combination of dj and Ct, both of which are distributions. The vector 
di_|_i is a running average of di and the vectors Cq for G (1, 2, . . . , t}. 

Returning to the algorithm, the loop continues for a total of T iterations. 
Then the base models are combined using a weighted voting scheme slightly 
different from that of AdaBoost and the original AveBoost from [7]: each model’s 
weight is log{l/ {j3tlt)) instead of log{l/j3t)- AveBoost2 is actually AdaBoost 
with Pt replaced by Ptlt- However, we wrote the AveBoost2 pseudocode as we 
did to make the running average calculation of the distribution explicit. 

AveBoost2 can be seen as a relaxed version of AdaBoost. When training ex- 
amples are noisy and therefore difficult to fit, AdaBoost is known to increase the 
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weights of those examples to excess and overfit them [3] because many consecu- 
tive base models may not learn them properly. AveBoost2’s averaging does not 
allow the weights of noisy examples to increase rapidly, thereby mitigating the 
overfitting problem. We therefore expect AveBoost2 to outperform AdaBoost 
on the noisy datasets to a greater extent than on the original datasets. We also 
expect AveBoost2’s advantage to be greater for smaller numbers of base models. 
When AveBoost2 creates a large ensemble, later training set distributions (and 
therefore later base models) cannot be too different from each other because 
they are prepared by averaging over many previous distributions. Therefore, we 
expect that later models will not have as much of an impact on performance. 



4 Theory 



In this section, we give bounds on the training error and generalization error 
(difference between training and test error) . Not surprisingly, the relaxed nature 
of AveBoost2 relative to AdaBoost caused us to obtain a worse training error 
bound but superior generalization error bound for AveBoost2 relative to Ad- 
aBoost. Due to space limitations, we defer the proofs and more intuition on the 
theoretical frameworks that we use to a longer version of this paper. AveBoost2’s 
training error bound is stated in the following theorem. 

Theorem 1. In AveBoost2, suppose the weak learning algorithm Lb generates 
hypotheses with errors ei, € 2 , ■ ■ ■ , er where each ct < 1/2. Then the ensemble’s 
error e = bounded as follows: 






t -i- 1 




t 

2£t(l-£t) 



1 

4et(l — £t) 



This bound is non-trivial (e < 1); but greater than that of AdaBoost [4]: 



T 

i=l 



To derive our generalization error bound, we use the algorithmic stability 
framework of [6] . Intuitively, algorithmic stability is similar to Breiman’s notion 
of stability [2] - the more stable a learning algorithm is, the less of an effect 
changes to the training set have on the model returned. The more stable the 
learning algorithm is, the smaller the difference between the training and test 
errors tends to be, assuming that the training and test sets are drawn from 
the same distribution. We show that AveBoost2 is more stable than AdaBoost; 
therefore, the difference between the training and test errors is lower. We first 
give some preliminaries from [6] and then state our new result. 

For the following, T is the space of possible inputs, y = {0, 1} is the set of 
possible labels, and Z = X x y . 
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Definition 1 (Definition 2.5 from [6]). A learning algorithm is a process 
which takes as input a distribution p on Z with finite support and outputs a 
function fp-.X^ [0, 1]. For S G Z'" for some positive integer m, fs means fp 
where p is the uniform distribution on S. 

In the following, the error of / on an example (x, y) is c(/, {x, y)) = \f{x) — y\. 

Definition 2 (Definition 2.11 from [6]). A learning algorithm has Li-stabi- 
lity A if, for any two distributions p and q on X with finite support, 

Mz G Z, \c{fp,z) - c{fq,z)\ < X\\p-q\\i. 

In the following, D is a distribution on Z, S' ~ D™ is a set of m examples 
drawn from Z according to D, and S*’“ is S with example i G {l,2,...,m} 
removed (each i is chosen with probability 1/m) and example u ^ D added. 

Definition 3 (Definition 2.14 from [6]). A learning algorithm is {(3, S)-stable 

if 

Psr^D^{\c{fs,z) - c{fsi,u,z)\ <P)>l-S. 

Intuitively, fs and /sc« are models that result from running the learning 
algorithm on two slightly different training sets. As (3 and S decrease, the proba- 
bility of having smaller differences in errors between these two models increases, 
which means that the learning algorithm is more stable. Greater stability implies 
lower generalization error according to the following theorem. In the following, 
Ef^sifs) is the training error (error on the training set S) and Erro(fs) is the 
test error, i.e., the error on an example (x,y) chosen at random according to 
distribution D. 

Theorem 2 (Theorem 3.4 from [6]). Suppose a {(3, 6) -stable learning algo- 
rithm returns a hypothesis fs for any training set S ~ D™. Then for all t > 0 
and m>l, 

Psr^D^{\Errs{fs) - ErrD{fs)\ > t + (3 + 5) 

( —r'^m \ 3wf6 

- ®""^V2(2m/3+l)2j + 2m(3+l 

Intuitively, this theorem shows that lower values of (3 and <5 lead to lower 
probabilities of large differences between the training and test errors. We can 
finally state our theorem on AveBoost2’s stability. 

Theorem 3. Suppose the base model learning algorithm has Li-stability A and 
minjg{i 2 ,...,t} €«>£*> 0. Then, for sufficiently large m and for all T, Ave- 
Boost2 is {(3, 5) -stable, where 
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Table 1. The datasets used in the experiments. 



Data Set 


Training 

Set 


Test 

Set 


Inputs 


Classes 


Promoters 


84 


22 


57 


2 


Balance 


500 


125 


4 


3 


Breast Cancer 


559 


140 


9 


2 


German Credit 


800 


200 


20 


2 


Car Evaluation 


1382 


346 


6 


4 


Chess 


2556 


640 


36 


2 


Mushroom 


6499 


1625 


22 


2 


Nursery 


10368 


2592 


8 


5 


Connectd 


54045 


13512 


42 


3 



Table 2. Performance of AveBoost2 compared to AdaBoost. 



ORIGINAL 


Num. Base Models 


10% NOISE 


Num. Base Models 


Base Model 


10 


50 


100 


Base Model 


10 


50 


100 


Naive Bayes 
Decision Trees 
Decision Stumps 


+ 4 = 4-1 

-h2=6-l 

-hl=6-2 


+4=4-1 
+ 1=6-2 
+ 1=6-2 


+4=4-1 
+2=6-1 
+ 1=6-2 


Naive Bayes 
Decision Trees 
Decision Stumps 


+8=1-0 

+6=2-1 

+0=7-2 


+8=1-0 

+5=3-1 

+1=7-1 


+7=2-0 

+6=2-1 

+1=7-1 



For AdaBoost, the theorem is the same except that [6] 

, 2 ^2‘^+1(A+1)‘ 

m e: 

t=i * 

which is larger than the corresponding f3 for AveBoost2. This means that Ave- 
Boost2’s generalization error is less than that of AdaBoost by Theorem 2. 

Note that as e* decreases toward zero, the j3 and 6 for both AdaBoost and 
AveBoost increase, which means both algorithms are less stable. This makes 
sense because, as e* decreases, the upper bounds on the weights assigned to each 
base model {log-^ and log^^) increase. This means that each individual base 
model’s potential influence on the ensemble’s result is higher, so that changes in 
the individual base models lead to greater changes in the ensemble’s predictions. 

5 Experimental Results 

In this section, we compare AdaBoost and AveBoost2 on the nine UCI datasets 
[1] described in Table 1. We ran both algorithms with three different values of 
T, which is the maximum number of base models that the algorithm is allowed 
to construct: 10, 50, and 100. Each result reported is the average over 50 results 
obtained by performing 10 runs of 5-fold cross-validation. Table 1 shows the 
sizes of the training and test sets for the cross-validation runs. We also repeated 
these runs after adding 10% label noise. That is, we randomly chose 10% of the 
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0 0.1 0.2 0.3 0.4 0.5 



AdaBoost with Naive Bayes 



AdaBoost with Naive Bayes 



Fig. 3. Test set error rates of AdaBoost Fig. 4. Test set error rates of AdaBoost vs. 
vs. AveBoost2 (Naive Bayes, original AveBoost2 (Naive Bayes, noisy datasets), 
datasets). 



examples in each dataset and changed their labels to one of the remaining labels 
with equal probability. 

Table 2 shows how often AveBoost2 significantly outperformed, performed 
comparably with, and significantly underperformed AdaBoost. For example, on 
the original datasets and with 10 Naive Bayes base models, AveBoost2 signifi- 
cantly outperformed^ AdaBoost on four datasets, performed comparably on four 
datasets, and performed significantly worse on one, which is written as “-1-4=4- 
1.” Figures 3 and 4 compare the error rates of AdaBoost and AveBoost2 with 
Naive Bayes base models on the original and noisy datasets, respectively. In all 
the plots presented in this paper, each point marks the error rates of two algo- 
rithms when run with the number of base models indicated in the legend and a 
particular dataset. The diagonal lines in the plots contain points at which the two 
algorithms have equal error. Therefore, points below/ above the line correspond 
to the error of the algorithm indicated on the y-axis being less than/greater than 
the error of the algorithm indicated on the x-axis, respectively. For Naive Bayes 
base models, AveBoost2 performs much better than AdaBoost overall, especially 
on the noisy datasets. 

We compare AdaBoost and AveBoost2 using decision tree base models in fig- 
ure 5 (original datasets) and figure 6 (noisy datasets). On the original datasets, 
the performances of the two algorithms are comparable. However, on the noisy 
datasets, AveBoost2 is superior for all except the Balance dataset. On the Bal- 
ance dataset, AdaBoost actually performed as much as 10% better on the noisy 
data than the original data, which is strange, and needs to be investigated fur- 
ther. On the other hand, AveBoost2 performed worse on the noisy Balance data 
than on the original Balance data. Figure 7 gives the error rate comparison be- 
tween AdaBoost and AveBoost2 with decision stump base models on the original 
datasets. Figure 8 gives the same comparison on the noisy datasets. With deci- 
sion stumps, the two algorithms always seem to perform comparably. We suspect 

® We use a t-test with a — 0.05 to compare all the classifiers in this paper. 
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AdaBoost with Decision T rees 



AdaBoost with Decision T rees 



Fig. 5. Test set error rates of AdaBoost Fig. 6. Test set error rates of AdaBoost 
vs. AveBoost2 (Decision Trees, original vs. AveBoost2 (Decision Trees, noisy 
datasets). datasets). 





AdaBoost with Decision Stumps 



AdaBoost with Decision Stumps 



Fig. 7. Test set error rates of AdaBoost Fig. 8. Test set error rates of AdaBoost 
vs. AveBoost2 (Decision Stumps, original vs. AveBoost2 (Decision Stumps, noisy 
datasets). datasets). 



that decision stumps are too stable to allow the different distribution calculation 
methods of AdaBoost and AveBoost2 to yield significant differences. In all the 
results, we did not see the hypothesized differences in performance as a function 
of the number of base models. 

6 Conclusions 

We presented AveBoost2, a boosting algorithm that trains each base model 
using a training example weight vector that is based on the performances of all 
the previous base models rather than just the previous one. We discussed our 
theoretical results and demonstrated empirical results that are superior overall 
to AdaBoost; especially on datasets with label noise. 

Our theoretical and empirical results do not account for what happens as 
the amount of noise in the data changes. We plan to derive such results. In 
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a longer version of this paper, we plan to perform a more detailed empirical 
analysis including the performances of the base models and ensembles on the 
training and test sets, correlations among the base models, ranges of the weights 
of regular and noisy examples, etc. In [7], we performed such an analysis to a 
limited extent for the original AveBoost and were able to confirm some of what 
we hypothesized there and in this paper: the base model accuracies tend to be 
higher than for AdaBoost, the correlations among the base models also tend to 
be higher, and the ranges of the weights of the training examples tends to be 
lower. We were unable to repeat this analysis here due to a lack of space. We 
also plan to compare our algorithm to other efforts to make boosting work better 
with noisy data, such as identifying examples that have consistently high weight 
as noisy and; therefore, untrustworthy [8] . 
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Abstract. Ensemble methods improve accuracy by combining the pre- 
dictions of a set of different hypotheses. A well-known method for gen- 
erating hypothesis ensembles is Bagging. One of the main drawbacks 
of ensemble methods in general, and Bagging in particular, is the huge 
amount of computational resources required to learn, store, and apply 
the set of models. Another problem is that even using the bootstrap 
technique, many simple models are similar, so limiting the ensemble di- 
versity. In this work, we investigate an optimization technique based on 
sharing the common parts of the models from an ensemble formed by de- 
cision trees in order to minimize both problems. Concretely, we employ 
a structure called decision multi-tree which can contain simultaneously a 
set of decision trees and hence consider just once the “repeated” parts. A 
thorough experimental evaluation is included to show that the proposed 
optimisation technique pays off in practice. 

Keywords: Ensemble Methods, Decision Trees, Bagging. 



1 Introduction 

With the goal of improving model accuracy, there has been an increasing interest 
in defining methods that combine hypotheses. These methods construct a set of 
hypotheses (ensemble), and then combine the components of the ensemble in 
some way (typically by a weighted or unweighted voting) in order to classify 
examples. The accuracy obtained will be often better than that of the individual 
components of the ensemble. This technique is known as Ensemble Methods [3]. 

This accuracy improvement of ensemble methods can be intuitively justified 
because the combined model represents an increase in expressiveness over the 
single components of the ensemble, and the fact that the combination of uncor- 
related errors avoids over- fitting. The quality of the generated ensemble highly 
depends on the accuracy and diversity of its individual components [9] . 

Many methods have been proposed to construct a set of classifiers from a 
single evidence. These techniques have been applied in many different learning 
algorithms. Dietterich [3] distinguishes different kinds of ensemble construction 
methods, being probably the methods based on the manipulation of the training 
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examples the most frequently used. The common idea of this kind of methods 
boils down to several times the same learning algorithm, each time with a dif- 
ferent subset or weighting of the training examples, thus generating a different 
classifier for each set. The relevant issue in these methods is to define a good 
mechanism to generate subsets from the set of training examples. For instance. 
Bagging [2, 16], Boosting [8, 16] and Cross-validated committees [13] are ensemble 
methods of this family. 

Bagging is derived from the technique known as bootstrap aggregation. This 
method constructs subsets by generating a sample of m training examples, se- 
lected randomly (and with replacement) from the original training set of m 
instances. The new subsets of training examples are called bootstrap replicates. 

There have been some works that compare the performance of ensemble 
methods. Dietterich has compared Bagging, Boosting and Randomisation en- 
semble methods experimentally in [4]. The conclusions are that in problems 
without noise Boosting gets the best results, while the results of Bagging and 
Randomisation are quite similar. With respect to noisy datasets. Bagging is the 
best method, followed by Randomisation, and, finally. Boosting. 

Ensemble methods have also important drawbacks. Probably, the most im- 
portant problem is that they require the generation, storage, and application of a 
set of models in order to predict future cases. This represents an important con- 
sumption of resources, in both scenarios: learning process and predicting new 
cases. This important hindrances frequently limit the application of ensemble 
methods to real problems. 

Let us consider the following scenario, given the classical playtennis [11] prob- 
lem; we construct an ensemble of four decision trees by applying Bagging, over 
the C4.5 decision-tree learning algorithm, i.e., we learn four decision trees with 
C4.5 from four different bootstrap replicates. The four trees learned are shown 
in Figure 1. 

If we observe the four decision trees, we can appreciate that there are many 
similarities among them. For instance. Decision Tree 1 and Decision Tree 2, as 
well as Decision Tree 3 and Decision Tree 4, have the same condition at the root. 
More concretely. Decision Tree 1 and Decision Tree 2 are almost identical, the 
only difference between both trees is that Decision Tree 1 has an additional split 
in a node considered as a leaf in Decision Tree 2. 

Furthermore, the first consequence of this phenomenon is that most of the 
solutions are similar, and hence, the errors can be correlated. All these patterns 
and regularities are not considered when learning ensembles of decision trees, 
and therefore, this process is also usually expensive in terms of computational 
cost. Motivated by these two problems, we present in this work an algorithm that 
exploits these regularities, and therefore, it allows a better dealing of resources 
when learning ensembles of decision trees. We employ a structure called decision 
multi-tree that can contain simultaneously a set of decision trees sharing their 
common parts. In previous works [6], we developed the idea of options trees [10] 
into the multi-structure, using a beam-search method to populate the multi-tree, 
mostly based on the randomisation technique, and several fusion techniques to 
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Decision Tree 1 Decision Tree 2 




Decision Tree 3 Decision Tree 4 




Fig. 1. Four different decision trees for the playtennis problem. 

merge the solutions in the multi-tree. One of the most delicate things of our 
previous approach was the choice of alternate splits. The use of Bagging in multi- 
trees solves this problem and, furthermore, it can allow a fairer comparison 
between the multi-tree and other ensemble methods. Although the presented 
algorithm is based on Bagging, the same idea could be easily applied to other 
ensembles methods such as Boosting. 

The paper is organised as follows. First, in Section 2, we introduce the de- 
cision multi-tree structure. Section 3 introduces the algorithm that allows a 
decision multi-tree to be constructed by employing bootstrap replicates. A thor- 
ough experimental evaluation is included in Section 4. Finally, the last section 
presents the conclusions and proposes some future work. 

2 Decision Multi-trees 

In this section we present the decision multi-tree structure. This structure can be 
seen as a generalisation of a classical decision tree. Basically, a decision multi-tree 
is a tree in which the rejected splits are not removed, but stored as suspended 
nodes. The further exploration of these nodes after the first solution is built 
permits the generation of new models. For this reason, we call this structure 
decision multi-tree, rather than a decision tree. Since each new model is ob- 
tained by continuing the construction of the multi-tree, these models share their 
common parts. 

Likewise, a decision multi-tree can also be seen as an AND/OR tree [12, 
14], if one considers the split nodes as being OR-nodes and considers the nodes 
generated by an exploited OR-node as being AND-nodes. 




44 



Vicent Estruch et al. 



Formally, a decision multi-tree is formed by an AND-node on the root, with 
a set of children which are OR-nodes (each one represents the split considered) . 
Each OR-node can be active or suspended. The active OR-nodes have a set of 
children which are AND-nodes corresponding to a descendant of the split. This 
schema is repeated for each new descendant node (AND-node), until an AND- 
node that is not further explored and it is assigned a class (a leaf). 




Figure 2 shows a decision multi-tree that contains the four decision trees of 
Figure 1. AND-nodes are represented with no filled circles and they have an arc 
under the node. Leaves are represented by rectangles. OR-nodes are expressed by 
black-filled circles. In a decision multi-tree, as we can see, there are alternatively 
levels of AND-nodes and OR-nodes. Note that a classic decision tree can be seen 
as a decision multi-tree where only one OR-node is explored at each OR-node 
level. 

In previous work [7], we have introduced an ensemble method that employs 
the decision multi-tree structure. In that work, the ensemble method performs a 
beam-search population of the decision multi-tree based on a random selection 
of the suspended nodes to be explored. However, one of the critical issues about 
this technique is the criterion for choosing the OR-node to “wake” from its 
suspended state. 

3 Bagging Construction of a Decision Multi-tree 

In this section we present our method to construct a decision multi-tree by using 
different training sets. As in Bagging, each training set is a bootstrap replicate 




Bagging Decision Multi-trees 



45 



of the original training set. However, as we have mentioned in the introduction, 
our approach differs from Bagging in that we construct a single structure (a 
multi-tree) that includes the ensemble of decision trees obtained by Bagging but 
without repeating their common parts. 

In order to implement this, we use the first bootstrap replicate for filling the 
multi-tree with a single decision tree. Then, for each new bootstrap replicate 
we continue the construction of the multi-tree, but only exploiting those nodes 
which have not been considered in the previous bootstrap replicates (i.e. they 
are suspended OR-nodes). 

3.1 Algorithm 

The following algorithm formalises this process: 

Algorithm Bagging-Multi-tree (INPUT Urdataset, ndnteger; OUTPUT 
M:multi-tree) {n is the number of iterations} 

M=Initialize-multiJree{)\ {M only contains an empty AND-node} 

for to n do 

D=Bootstrap_replicate(if); {a bootstrap replicate is generated} 
if then LearnM {M .root, D) 

else LearnMBagg[M.root,D) 

end for 

end 

As we can see, the algorithm begins by generating an empty multi-tree, that 
is, a multi-tree that only contains one AND-node. Then, a bootstrap replicate 
is obtained and processed in each step of the main loop. At the first iteration, a 
multi-tree is constructed (procedure LearnM) and, for the following iterations, 
this structure is populated using a new bootstrap replicate for selecting the 
suspended OR-node that must be exploited (procedure LearnM Bagg). In what 
follows we describe both processes. 

Procedure LearnM generates a multi-tree by selecting in each OR-level the 
best OR-node to be exploited (using, e.g. GainRatio or other splitting criterion). 
The process is repeated for the descendant of the active OR-node until a leaf is 
reached. 

Procedure LearnM (INPUT A: AND-node, D:training dataset; OUTPUT 
M:multi-tree) 

if A=leaf then exit; 

ListjofJDRjnodes=CreateJDRjnodes{X,D)\ {generate a list with one OR- 
node for each possible split and their descendants (AND-nodes)} 
B=SelectJ3estJDRjnode{L,D)-, {the best node according to the split opti- 
mality criterion is selected} 

Activate{B); {the selected OR-node is activated} 
for Y € childrenjof[B) do 

D'=filter(D, Y): {the examples of D that fall in node Y are selected} 

LearnM (Y,D') 

end for 

end procedure 
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As in the above process, Procedure LearnMBagg also selects in each OR- 
level the best OR-node to be exploited. But in this case the selected OR-node 
can be active or suspended. If it is an active node (which means that it has 
been exploited previously) then the procedure continues exploring their chil- 
dren. However, if it is an OR-node suspended, it is activated and their children 
are added to the multi-tree. 

Procedure LearnMBagg (INPUT X:AND-node, Urtraining dataset; OUT- 
PUT M:multi-tree) 

if A=leaf then exit; 

List^of JJRjnodes=UpdateJJRjnodes{X, D)-, {update the split optimality of 
OR-nodes according to the training set D} 

B=Select-BestJDRjnode{L, D)\ 
if B is active then 

for Y e children-of(B) do 
D' ^filter (D,Y)- 
LearnM Bagg{Y, D') 
end for 

else 

Activate{B)-, 

for Y € childrenjof(B) do 
D'=filter{D,Y)-, 

LearnM(Y,D') {the multi-tree is expanded from node Y} 

end for 
end procedure 

As regards the combination of predictions, there is an important difference 
with respect to classical ensemble methods: fusion points are distributed allover 
the multi-tree structure. Concretely, we combine the votes at the AND-nodes 
using the maximum fusion strategy. This strategy obtains the best results ac- 
cording to the experiments of [6]. 

3.2 Bagging Decision Trees versus Bagging Decision Multi-trees 

Although the algorithm presented in this section is inspired by the Bagging 
method over decision trees, there are some differences between these two tech- 
niques and there can be differences in the errors performed by both methods. 
The most significant differences are how the ensemble is used for predicting new 
cases. 

In Bagging decision trees, there is a significant probability (as we will see in 
the experiments) of learning similar trees. In the prediction phase, the repeated 
decision trees will be more determinant in the final decision. In the decision 
multi-tree, since we avoid duplicated trees, all the leaves have identical weight 
in the final decision. Note that this mechanism of ignoring repeated leaves for 
the prediction can be intuitively justified by the fact of having a set of models 
semantically different can help to improve the accuracy of the predictions. 

Additionally, in a decision multi-tree the fusion of the predictions is per- 
formed internally at the OR-nodes, while in an ensemble of decision trees the 
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voting is performed using each independent decision tree. Performing the fusion 
in the internal nodes of the multi-tree can alter the colour of the final decision. 
Furthermore, it represents an important improvement in the response time of 
the ensemble. 

4 Experiments 

In this section, we present an experimental evaluation of our approach, as is 
implemented in the SMILES system [5]. SMILES is a multi-purpose machine 
learning system which (among many other features) includes the implementation 
of a multiple decision tree learner. 



Table 1. Datasets used in the experiments. 



# 


Datasets 


Size 


Classes 


Nom. Attr. 


Num. Attr. 


1 


Balance Scale 


325 


3 


0 


4 


2 


Breast Cancer 


699 


2 


0 


9 


3 


Breast Cancer Wisconsin 


569 


2 


1 


30 


4 


Chess 


3196 


2 


36 


0 


5 


Dermatology 


366 


6 


33 


1 


6 


Hayes-Roth 


106 


3 


5 


0 


7 


Heart Disease 


920 


5 


8 


5 


8 


Hepatitis 


155 


2 


14 


5 


9 


Horse-colic-outcome 


366 


3 


14 


8 


10 


Horse-colic-surgical 


366 


2 


14 


8 


11 


House Congressional Voting 


435 


2 


16 


0 


12 


Iris Plan 


158 


3 


0 


4 


13 


MONK’sl 


566 


2 


6 


0 


14 


MONK’s2 


601 


2 


6 


0 


15 


MONK’sS 


554 


2 


6 


0 


16 


New Thyroid 


215 


3 


0 


5 


17 


Postoperative Patient 


90 


3 


7 


1 


18 


Segmentation Image Database 


2310 


7 


0 


14 


19 


Teaching Assistant Evaluation 


151 


3 


2 


3 


20 


Thyroid ANN 


7200 


3 


15 


0 


21 


Tic-Tac-Toe Endgame 


958 


2 


8 


0 


22 


Wine Recognition 


178 


3 


0 


13 



For the experimental evaluation, we have employed 22 datasets from the UCI 
dataset repository [1]. Some details of the datasets are included in Table 1. 

For the experiments, we used GainRatio [15] as a splitting criterion. Pruning 
is not enabled. The experiments were performed on a Pentium III-800 Mhz with 
180MB of memory running Linux 2.4.2. 

First, let us examine how the multi-tree grows with respect to the number 
of iterations of Bagging. Table 2 shows the mean of the number of nodes (AND 
nodes and OR nodes) in the multi-tree of all the datasets. The results are rather 
surprising since the number of nodes does not increase as one would expect. This 
reflects the facts that Bagging tends to repeat frequently the same decision trees. 
We also must remark that we do not remove the suspended OR nodes. This is 
the reason for the relatively high number of nodes of the multi-tree with just 
one iteration. 
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Table 2. Mean size of multi-trees in number of nodes (OR-nodes and And-nodes). 



Number of iterations 


1 


20 


40 


60 


Mean size 


4021 


9287 


12927 


15493 



Table 3. Mean accuracy. 





1 1 1 


1 20 


1 12 


1 20 1 


# 


BagMDT 


BagDT 


BagMDT 


BagDT 


BagMDT 


BagDT 


BagMDT 


BagDT 


1 


77,94 


79,40 


79,26 


81,92 


80,13 


82,85 


80,63 


82,92 


2 


93,81 


94,12 


94,28 


96,07 


94,64 


96,11 


94,81 


96,28 


3 


92,43 


94,22 


93,48 


95,61 


93,13 


95,91 


93,29 


95,91 


4 


99,61 


99,40 


99,37 


99,43 


99,42 


99,40 


99,28 


99,40 


5 


91,58 


93,80 


94,00 


96,43 


94,31 


96,89 


94,75 


97,05 


6 


72,13 


72,50 


73,44 


47,31 


73,44 


54,25 


73,19 


52,56 


7 


51,64 


47,26 


54,58 


46,21 


54,72 


46,45 


55,98 


46,52 


8 


76,20 


78,98 


77,27 


83,23 


77,47 


82,38 


77,80 


82,76 


9 


62,72 


65,82 


64,14 


65,66 


64,81 


65,77 


65,33 


65,79 


10 


78,53 


83,15 


81,92 


82,86 


82,19 


83,12 


82,50 


83,06 


11 


94,74 


95,47 


94,95 


96,45 


94,77 


96,55 


94,81 


96,58 


12 


94,13 


94,60 


95,13 


94,57 


94,13 


94,33 


94,47 


94,20 


13 


96,73 


95,11 


95,93 


99,84 


96,42 


100,00 


95,73 


100,00 


14 


70,97 


62,66 


67,62 


66,38 


67,45 


66,91 


66,17 


67,19 


15 


97,47 


98,67 


98,00 


98,74 


97,69 


98,90 


97,93 


98,88 


16 


92,81 


92,95 


92,76 


94,24 


93,29 


94,56 


93,10 


94,76 


17 


66,00 


59,89 


66,25 


63,23 


65,88 


64,44 


65,50 


65,00 


18 


95,91 


96,90 


95,29 


97,67 


95,23 


97,69 


95,23 


97,62 


19 


61,93 


56,46 


56,93 


59,35 


57,33 


61,50 


55,60 


60,51 


20 


99,23 


99,72 


99,15 


99,65 


99,06 


99,65 


98,98 


99,65 


21 


77,16 


79,16 


79,54 


83,17 


80,32 


83,82 


80,34 


83,80 


22 


93,00 


93,49 


93,12 


95,26 


93,06 


95,79 


92,94 


95,90 


Gcomean 


82.18 


81.66 


82.60 


81.67 


82.73 


82.55 


82.69 


82.46 


Average 


83.49 


83.35 


83.93 


83.79 


84.04 


84.42 


84.02 


84.38 



Table 3 shows the accuracy comparison (10 x 10-fold cross-validation) between 
classical Bagging as it is implemented in Weka (we call it BagDT), and the pro- 
posed algorithm as it is implemented in Smiles (we call it BagMDT) depending 
on the number of iterations of the ensemble method. We have employed the 
version 3.2.3 of Weka^. We use J48 as base classifier (the Weka version of C4.5), 
with the default settings, except from pruning, which is not enabled. 

The results of both algorithms are similar. Initially, there is a slight advan- 
tage for the BagMDT due to some small differences between the two systems in 
the implementation of the C4.5 algorithm. When the number of iterations is in- 
creased, both learning methods improve the accuracy, and the difference between 
them is partially reduced (even BagDT obtain better results if we consider the 
arithmetic average). Nonetheless, it seems that BagMDT reaches the saturation 
point earlier, probably because the repeated “good” branches are not weighted 
more for higher number of iterations. 

Table 4 contains the average learning time for each classifier and dataset, 
and the geometric and arithmetic mean of all the datasets. From a practical 
point of view, it is resource consumption where we see the advantages of using 
decision multi-trees when learning ensembles of decision trees, since the training 

http : //www. cs .Waikato . ac .nz/~ml/weka/ 
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Table 4. Mean training time. 





1 


20 


40 


60 


# 


BagMDT 


BagDT 


BagMDT 


BagDT 


BagMDT 


BagDT 


BagMDT 


BagDT 


1 


0,04 


0,08 


0,31 


1,07 


0,59 


2,29 


0,85 


3,24 


2 


0,08 


0,06 


0,67 


1,07 


1,29 


2,32 


1,87 


3,19 


3 


0,32 


0,29 


2,50 


4,24 


4,81 


8,73 


6,94 


13,02 


4 


0,26 


0,53 


2,28 


6,39 


4,43 


12,44 


6,51 


18,72 


5 


0,08 


0,04 


0,60 


0,65 


1,11 


1,28 


1,60 


1,93 


6 


0,01 


0,01 


0,01 


0,18 


0,03 


0,35 


0,04 


0,57 


7 


0,27 


0,18 


2,53 


5,41 


5,96 


10,43 


21,95 


15,88 


8 


0,04 


0,03 


0,26 


0,36 


0,48 


0,75 


0,72 


1,13 


90 


0,15 


0,03 


1,01 


0,51 


2,30 


1,10 


2,82 


1,36 


10 


0,12 


0,04 


0,87 


1,16 


1,61 


2,06 


2,36 


3,13 


11 


0,02 


0,03 


0,15 


0,39 


0,39 


0,78 


0,43 


1,18 


12 


0,01 


0,01 


0,04 


0,09 


0,08 


0,18 


0,12 


0,27 


13 


0,01 


0,02 


0,06 


0,25 


0,16 


0,47 


0,17 


0,74 


14 


0,01 


0,04 


0,13 


0,50 


0,33 


0,91 


0,35 


1,50 


15 


0,01 


0,01 


0,04 


0,15 


0,12 


0,29 


0,13 


0,44 


16 


0,02 


0,02 


0,15 


0,24 


0,28 


0,49 


0,40 


0,75 


17 


0,01 


0,01 


0,03 


0,10 


0,07 


0,19 


0,08 


0,31 


18 


0,85 


1,17 


8,93 


16,90 


19,41 


33,56 


40,00 


51,31 


19 


0,02 


0,02 


0,15 


0,28 


0,28 


0,55 


0,40 


0,85 


20 


1,13 


0,69 


9,84 


9,53 


20,59 


18,95 


31,99 


28,54 


21 


0,03 


0,06 


0,28 


0,74 


0,72 


1,43 


0,78 


2,35 


22 


0,04 


0,03 


0,27 


0,44 


0,52 


0,88 


0,74 


1,33 


Gcomean 


0.04 


0.05 


0.34 


0.73 


0.72 


1.44 


1.03 


2.19 


Average 


0.16 


0.15 


1.41 


2.30 


2.98 


4.57 


5.51 


6.90 




\ BagMDT 
\ BagDT 



Fig. 3. Training time comparison. 



time is significantly reduced. To appreciate better this feature, Figure 3 shows the 
geometric average training time of Bagging, using Weka, and Smiles depending on 
the size of the ensemble. While Bagging decision trees shows a linear increase in 
time, Bagging decision multi-trees shows also a linear increase with a significant 
lower slope. 



5 Conclusions 



In this work, we present an algorithm that reduce the high computational cost 
characteristic of Bagging method. The technique is based on the use of the multi- 
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tree structure. This structure allows trees to share the common parts. Therefore, 
the more structurally similar the trees are the better the improvement of the 
computational resources made by the multi-tree. Additionally, the multi-tree also 
makes it possible to enhance a key parameter in ensemble techniques: diversity. 
Note that by applying Bagging over c.4.5, the set of decision trees obtained 
presents many structural similarities (low diversity), so that misclassification 
errors can be easily correlated. But if these decision trees are organised into 
a multi-tree the redundancy is reduced and theoretically the accuracy should 
be better. In fact. Bagging multi-tree reaches the saturation point earlier than 
classical Bagging. However, a collateral effect arises when multi-tree is used, 
because this structure does not take into account how many times a branch 
tree appears. Therefore, the frequent branches and unusual ones have the same 
weight in the classification stage. The experimental evaluation makes clear that 
it would be feasible to get a trade-off between the redundancy and the diversity, 
by using the multi-tree structure. 

Summing up, the multi-tree structure can be viewed as a feasible and elegant 
way to overcome the main inherent drawbacks (huge amount of the computa- 
tional resources and redundancy) of Bagging. 

As future work, it would be interesting to investigate how we can improve 
accuracy by an adequate adjustment of the diversity and redundancy parameters 
in Bagging multi-trees. Additionally, we also plan to study whether the multi-tree 
is able to enhance other well-known ensemble methods, such as boosting. 
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Abstract. An ensemble of classifiers based algorithm, Leam++, was recently 
introduced that is capable of incrementally learning new information from data- 
sets that consecutively become available, even if the new data introduce addi- 
tional classes that were not formerly seen. The algorithm does not require ac- 
cess to previously used datasets, yet it is capable of largely retaining the 
previously acquired knowledge. However, Learn-l-l- suffers from the inherent 
“out-voting” problem when asked to learn new classes, which causes it to gen- 
erate an unnecessarily large number of classifiers. This paper proposes a modi- 
fied version of this algorithm, called Learn-l-l-.MT that not only reduces the 
number of classifiers generated, but also provides performance improvements. 
The out-voting problem, the new algorithm and its promising results on two 
benchmark datasets as well as on one real world application are presented. 



1 Introduction 

It is well known that the amount of training data available and how well the data rep- 
resent the underlying distribution are of paramount importance for an automated clas- 
sifier’s satisfactory performance. For many applications of practical interest, obtain- 
ing such adequate and representative data is often expensive, tedious, and time 
consuming. Consequently, it is not uncommon for the entire data to be obtained in 
installments, over a period of time. Such scenarios require a classifier to be trained 
and incrementally updated - as new data become available - where the classifier 
needs to learn the novel information provided by the new data without forgetting the 
knowledge previously acquired from the data seen earlier. This raises the so-called 
stability-plasticity dilemma [1]: a completely stable classifier can retain knowledge, 
but cannot learn new information, whereas a completely plastic classifier can instantly 
learn new information, but cannot retain previous knowledge. Many popular classifi- 
ers, such as the ubiquitous multilayer perceptron (MLP) or the radial basis function 
networks, are not structurally suitable for incremental learning, since they are “com- 
pletely stable” classifiers. The approach generally followed for learning from new 
data involves discarding the existing classifier, combining the old and the new data 
and training a new classifier from scratch using the aggregate data. This causes the 
previously learned information to be lost, a phenomenon known as catastrophic for- 
getting [2] . Furthermore, training with the combined data may not even be feasible, if 
the previously used data are lost, corrupted, prohibitively large, or otherwise unavail- 
able. 
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We have recently introduced an algorithm, called Learn++, capable of learning in- 
crementally, even under hostile learning conditions: not only does Learn-H- assume 
the previous data to be no longer available, but it also allows additional classes to be 
introduced with new data, while retaining the previously acquired knowledge. 

Learn-HH- is an ensemble approach, inspired primarily by the AdaBoost algorithm. 
Similar to AdaBoost, Learn-H- also creates an ensemble of (weak) classifiers, each 
trained on a subset of the current training dataset, and later combined through 
weighted majority voting. Training instances for each classifier are drawn from an 
iteratively updated distribution. The main difference is that the distribution update 
rule in AdaBoost is based on the performance of the previous hypothesis [3], which 
focuses the algorithm on difficult instances, whereas that of Learn-H- is based on the 
performance of the entire ensemble [4], which focuses this algorithm on instances that 
carry novel information. This distinction gives Learn-H- the ability to learn new data, 
even when previously unseen classes are introduced. As new data arrive, Learn-H- 
generates additional classifiers, until the ensemble learns the novel information. Since 
no classifier is discarded, previously acquired knowledge is retained. Other ap- 
proaches suggested for incremental learning, a bibliography of ensemble systems and 
their applications can be found in and within the references of [4 ~9]. 

As reported in [4,5], Learn-H- works rather well on a variety of real world prob- 
lems, though there is much room for improvement. An issue of concern is the rela- 
tively large number of classifiers required for learning instances coming from a new 
class. This is because, when a new dataset introduces a previously unseen class, new 
classifiers are trained to learn the new class; however, the existing classifiers continue 
to misclassify instances from the new class. Therefore, the decisions of latter classifi- 
ers that recognize the new class are out-voted by the previous classifiers that do not 
recognize the new class, until a sufficient number of new classifiers are generated that 
recognize the new class. This leads to classifier proliferation. 

In this contribution, we first describe the out-voting problem associated with the 
original Learn-H-, propose a modified version of the algorithm to address this issue, 
and present some preliminary simulation results on three benchmark datasets. 



2 Learn-H-.MT 

In ensemble approaches that use a voting mechanism for combining classifier outputs, 
each classifier votes on the class it predicts [10, 11]. The final classification is then 
determined as the class that receives the highest total vote from all classifiers. 
Learn-H- uses weighted majority voting [12], where each classifier receives a voting 
weight based on its training performance. This works well in practice for most appli- 
cations. However, for incremental learning problems that involve introduction of new 
classes, the voting scheme proves to be unfair towards the newly introduced class: 
since none of the previously generated classifiers can pick the new class, a relatively 
large number of new classifiers that recognize the new class are needed, so that their 
total weight can out-vote the first batch of classifiers on instances of the new class. 
This in return populates the ensemble with an unnecessarily large number of classifi- 
ers. Learn-H-.MT is specifically designed to address the classifier proliferation issue. 
The novelty in Learn-H-.MT is the way by which the voting weights are determined. 
Learn-H-.MT also uses a set of voting weights based on the classifiers’ performances, 
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however, these weights are then adjusted based on the classification of the specific 
instance at the time of testing, through dynamic weight voting (DWV). 

For any given test instance, Learn++.MT compares the class predictions of each 
classifier and cross-references them against the classes on which they were trained. If 
a subsequent ensemble overwhelmingly chooses a class it has seen before, then the 
voting weights of those classifiers not trained with that class are proportionally re- 
duced. As an example, assume that an ensemble has seen classes 1 and 2, and a sec- 
ond ensemble has seen classes 1, 2 and 3. For a given instance, if the second ensemble 
(trained on class 3) picks class 3, the classifiers in the first ensemble (which has not 
seen class 3) reduce their voting weights in proportion to the confidence of the second 
ensemble. In other words, when the algorithm detects that the new classifiers over- 
whelmingly choose a new class on which they were trained, the weights of the other 
classifiers which have not seen this new class are reduced. The Learn-H-.MT algo- 
rithm is given in Figures 1 and 2, and explained in detail below. 

For each dataset that becomes available to Learn-i-H-.MT, the inputs to the al- 
gorithm are (i) a sequence of m,^ training data instances x^ and their correct labels y^, 
(ii) a classification algorithm BaseClassifier, and (iii) an integer T,j specifying the 
maximum number of classifiers to be generated using that database. If the algorithm 
is seeing its first database (k=I), a data distribution (D,) - from which training in- 
stances will be drawn - is initialized to be uniform, making the probability of any 
instance being selected equal. If k>l then the distribution is updated from the previ- 
ous step based on the performance of the existing ensemble on the new data. The 
algorithm then adds T,^ classifiers to the ensemble starting at t=eT|^H-l where eT^ de- 
notes the number of classifiers that currently exist in the ensemble. 

For each iteration t, the instance weights, w^, from the previous iteration are first 
normalized (step 1) to create a weight distribution D^. A hypothesis, h^, is generated 
from a subset of that is drawn from D, (step 2). The error, s^, of h^ is then calcu- 
lated; if S; > Vi, the algorithm deems the current classifier, h,, to be too weak, discards 
it, and returns to step 2, otherwise, calculates the normalized error Pj (step 3). The 
class labels of the training instances used to generate this hypothesis are then stored as 
CTr^ (step 4). The dynamic weight voting (DWV) algorithm is called to obtain the 
composite hypothesis, H,, of the ensemble (step 5). Hj represents the ensemble deci- 
sion of the first t hypotheses generated thus far. The error of the composite hypothe- 
sis, Ej is then computed and normalized (step 6). The instance weights w^ are finally 
updated according to the performance of H, (step?) such that the weights of instances 
correctly classified by H, are reduced (and those that are misclassified are effectively 
increased). This ensures that the ensemble focus on those regions of the feature space 
that are not yet learned, paving the way for incremental learning. 

The inputs to the dynamic weight voting algorithm are (i) the current training data 
(during training) or any test instance, (ii) classifiers h^, (iii) P,, normalized error for 
each h|, and (iv) the vector CTq containing the classes on which h, has been trained. 
Classifier weights are first initialized (step 1), where each classifier receives a stan- 
dard weight that is inversely proportional to its normalized error P, so that those clas- 
sifiers that performed well on their training data are given higher voting weights. A 
normalization factor is then created as the sum of the weights of all classifiers 
trained on instances from class c (step 2). 
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Input: For each dataset 2)* k=l.2 K 

• Sequence of i=l,....mk instances.v, with labels y, e \\ — {l....,c} 

• Weak learning algorithm BaseClassifier. 

• Integer 7*, specifying the number of iterations. 

Do for A' 1.2 K 

Ifk=l Initialize u’, = Z), (/) = 1/ /», = 0 for all /. 

Else Go to Step 5 to evaluate the current ensemble on new dataset S)*, 

*-i 

update weights, and recall current number of classifiers £’7'^ = 

;=i 

Do for/=e7^. +1, eT^ +2,..., eT/. + 7^. : 

/ 

1 . Set D, = wj Z so that I), is a distribution. 

/ 1=1 

2. Call BaseClassifier with a subset of randomly chosen using D,. 

3. Obtain /?, : X -> Y, and calculate its error: — Z 7),(/) 

/:/!,(. V, )#l, 



If 6’, > Vi, discard h, and go to step 2. Otherwise, compute 
normal ized error as — eJ(\ — . 

4. (Tr, = >*, to save labels of classes used in training /?,. 

5. Call DWV to obtain the composite hypothesis //,. 

6. Compute the error of the composite hypothesis £, = IA(0 

)».i, 



7. Set and update the instance weights: 

I , othenri.se 
Call DWV to obtain the final hypothesis. ///;,„/. 



H',+|(/) = ir, x< 



Fig. 1. Leam++.MT Algorithm. 



For each instance, a preliminary per-class confidence factor 0<P^<1 is generated 
(step 3). is the sum of weights of all the classifiers that choose class c divided by 
the sum of the weights of all classifiers trained with class c (which is ZJ. In effect, 
this can be considered as the ensemble assigned confidence of the instance for belong- 
ing to each of the c classes. Then, again for each class, the weights are adjusted for 
classifiers that have not been trained with that class, that is, the weights are lowered 
proportional to the ensemble’s preliminary confidence on that class (step 4). The 
final / composite hypothesis is then calculated as the maximum sum of the weights 
that chose a particular class (step 5). 
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Input; 

• Sequence of /=1 n training instances or any test instance .v, 

• Classifiers h,. 

• Corresponding error values, p,. 

• Classes, CTr, used in training 

For t=!,2 T where / is the total number classifiers 

1. Initialize classifier weights =log(l//?, ) 

2. Create normalization factor, Z, for each class 

Z^, = , fore = 1.2 f' classes 

Tr, 

3. Obtain preliminary decision /J, = ^ ^ 

t:h,(xj)=c / 

4. Update voting weights “ ^rxtnr, (' ~ K)' ^orc=l.2 C 

5. Compute final / composite hypothesis 

H f„uA^,) = avam3.\ 

r.h,(.x,)=c 



Fig. 2. Dynamic Weight Voting Algorithm for Leam++.MT. 



3 Learn++.MT Simulation Results 

Learn++.MT has been tested on several databases. For brevity, we present results on 
two benchmark databases and one real-world application. The benchmark databases 
are the Wine database and the Optical Character Recognition database from UCI [13], 
and the real world application is a gas identification problem for determining one of 
five volatile organic compounds based on chemical sensor data. MLPs - normally 
incapable of incremental learning - were used as base classifiers on all three cases. 
Base classifiers were all single layer MLPs with 20-50 nodes and a rather generous 
error goal of 0.1 - 0.01 to ensure weak classifiers with respect to the difficulty of the 
underlying problem. 



3.1 Wine Recognition Database 

The Wine Recognition database features 3 classes with 13 attributes. The database 
was split into two training, a validation, and a test dataset. The data distribution is 
given in Table 1. In order to test the algorithms’ ability to incrementally learn a new 
class, instances from class 3 are only included in the second dataset. Each algorithm 
was allowed to create a set number of classifiers (30) on each dataset. The optimal 
number of classifiers to retain for each dataset was automatically determined based on 
the maximum performance on the validation data. This process was applied 30 times 
on Learn-HH- and Learn-i-H-.MT to compare their generalization performance on the test 
data, the mean results of which are shown in Tables 2 and 3. Each row shows class- 
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by-class generalization performance of the ensemble on the test data after being 
trained with dataset k=l,2. The last two columns are the average overall gener- 
alization performance over 30 simulation trials (on the entire test data which includes 
instances from all three classes), and the standard deviation of the generalization per- 
formances. The number of classifiers in the ensemble after each training session is 
given in parentheses. 



Table 1. Wine Recognition database distribution. 



Class^ 


1 


2 


3 




26 


31 


0 




13 


16 


32 


Valid. 


7 


8 


5 


Test 


13 


16 


11 



Table 2. Leam-l-l- performance results on Wine Recognition database. 



Class^ 


1 


2 


3 


Gen. 


Std. 


2)i (6) 


99% 


96% 


- 


71% 


5.8 


a>2 (26) 


100% 


94% 


24% 


77% 


12.1 



Table 3. Learn-l-l-.MT performance results on Wine recognition database. 



Class-> 


1 


2 


3 


Gen. 


Std. 


(5) 


96% 


95% 


- 


70% 


6.0 


a>2 (6) 


99% 


87% 


90% 


92% 


5.0 



Tables 2 and 3 show that Learn-H-.MT not only incrementally learns the new class, 
but also outperforms its predecessor by 15% using a significantly fewer number of 
classifiers. The poor performance of Learn-H- in the new class (class 3) is explained 
below within the context of larger database simulations. 

3.2 Optical Character Recognition Database 

The optical character recognition (OCR) database features 10 classes (digits 0 ~ 9) 
with 64 attributes. The database was split into four to create three training and a test 
subset, whose distribution can be seen in Table 4. In this case, we wanted to evaluate 
the performance of each algorithm on a fixed number of classifiers (rather than de- 
termining the number of classifiers via a validation set) so that they can be compared 
on equal number of classifiers. Each algorithm was allowed to create five classifiers 
with the addition of each dataset (total of 15 classifiers in three training sessions). 
The data distribution was deliberately made rather challenging, specifically designed 
to test the algorithms’ ability to learn multiple new classes at once with each addi- 
tional dataset while retaining the knowledge of previously learned classes. In this 
incremental learning problem, instances from only six of the ten classes are present in 
each subsequent dataset resulting in a rather difficult problem. Results previously 
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obtained using Learn++ on this data using less challenging data distributions was in 
the order of lower to mid 90% range [4,5]. Results from this test are shown in Tables 
5 and 6, which is formatted similar to the previous tables. 



Table 4. OCR data distribution 



Class-> 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 




250 


250 


250 


0 


0 


250 


250 


250 


0 


0 


S>2 


150 


0 


150 


250 


0 


150 


0 


150 


250 


0 


©3 


0 


150 


0 


150 


400 


0 


150 


0 


150 


400 


Test 


110 


114 


111 


114 


113 


111 


111 


113 


110 


112 



Table 5. Learn++ performance results on OCR database. 



Class-> 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


Gen. 


Std. 




99% 


98% 


99% 


- 


- 


96% 


99% 


100% 


- 


- 


59% 


0 6% 


3>2 


98% 


98% 


99% 


32% 


- 


96% 


99% 


100% 


60% 


- 


68% 


18% 


3h 


98% 


96% 


99% 


94% 


22% 


96% 


99% 


100% 


90% 


13% 


81% 


4 . 0 % 



Table 6. Leam++.MT performance results on OCR database. 



Class-> 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


Gen. 


Std. 




95% 


98% 


98% 


- 


- 


95% 


99% 


100% 


- 


- 


58% 


0.8% 


S>2 


96% 


95% 


99% 


95% 


- 


95% 


98% 


100% 


98% 


- 


69% 


0.6% 


a >2 


67% 


95% 


92% 


98% 


83% 


63% 


98% 


100% 


95% 


96% 


89% 


0.7% 



Interesting observations can be made from these tables. First, we note that Learn++ 
was able to learn the new classes, 3 and 8, only poorly after they were first introduced 
in ^2 but able to learn them rather well, when further trained with these classes in 
^ 3 . Similarly, it performs rather poorly on classes 4 and 9 after they are first intro- 
duced in ^ 3 , though it is reasonable to expect that it would do well on these classes 
with additional training. More importantly however, Learn-H-.MT was able to learn 
new classes quite well in its first attempt. Finally, recall that the generalization per- 
formance of the algorithm is computed on the entire test data which included in- 
stances from all classes. This is why the generalization performance is only around 
60% after the first training session, since the algorithms have seen only six of the 10 
classes in the test data. Both Learn-H- and Learn-H-.MT exhibit an overall increase of 
generalization performance as new datasets are introduced - and hence the ability of 
incremental learning. Learn-i-H-.MT, however, is able to learn not only faster, but better 
than Learn-H-, as demonstrated by the significant jump in generalization performance 
(81% to 89%). 

3.3 Volatile Organic Compound Recognition Database 

The Volatile Organic Compound (VOC) database is a real world dataset that consist 
of 5 classes (toluene, xylene, hectane, octane and ketone) with 6 attributes coming 
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from six (quartz crystal microbalance type) chemical gas sensors. The dataset was 
divided into three training and a test dataset. The distribution of the data is given in 
Table 7, where a new class was introduced with each dataset. 



Table 7. Volatile Organic Compounds database. 



Class-^ 


1 


2 


3 


4 


5 




20 


0 


20 


0 


40 


S>2 


10 


25 


10 


0 


W 




10 


15 


10 


40 


10 


Test 


24 


24 


24 


40 


52 



Again both algorithms were incrementally trained with three subsequent training 
datasets. In this experiment, both algorithms were allowed to generate as many classi- 
fiers as necessary to obtain their maximum performance. Learn-i-H- generated a total 
of 36 classifiers to achieve its hest performance. Learn-H-.MT, however, not only 
generated only 16 classifiers, but it also provided significant improvement in gener- 
alization performance. 



Table 8. Learn-t-n performance results on VOC database. 



Class 


1 


2 


3 


4 


5 


Gen 


Std. 


»i (6) 


90% 


- 


91% 


- 


88% 


55% 


1.3% 


(12) 


89% 


61% 


95% 


- 


87% 


64% 


4.3% 


(18) 


82% 


95% 


94% 


12% 


83% 


69% 


6.6% 



Table 9. Leam-H-.MT performance results on VOC database. 



Class 


1 


2 


3 


4 


5 


Gen 


Std. 


(6) 


90% 


- 


89% 


- 


89% 


54% 


2.0% 


»2 (4) 


86% 


85% 


93% 


- 


75% 


63% 


3.5% 


(6) 


81% 


96% 


91% 


83% 


67% 


81% 


2.3% 



The results averaged over 20 trials are given in Tables 8 and 9. We note that both 
algorithms provide a performance characteristic that is similar to those obtained with 
the previous databases. Specifically, Tables 8 and 9 show a significant increase from 
Learn-H- to Learn-H-.MT on the average generalization performance. Furthermore, 
Learn-H-.MT was able to accomplish its performance using 20 fewer classifier, and 
learning each new class faster than its predecessor. 



4 Conclusions and Discussions 

In this paper we presented Learn-H-.MT, a modified version of our previously intro- 
duced incremental learning algorithm, Learn-H-. The novelty of the new algorithm is 
its use of preliminary confidence factors in assigning voting weights, based on a 
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cross-reference of the classes that have been seen by each classifier during training. 
Specifically, if a majority of the classifiers that have seen a class votes on that class, 
the voting weights of those classifiers who have not seen that class are reduced in 
proportion to the preliminary confidence. This allows the algorithm to dynamically 
adjust the voting weights for each test instance. The approach overcomes the out- 
voting problem inherent in the original version of Learn-i-H- and prevents proliferation 
of unnecessary classifiers. The new algorithm also provided substantial improvements 
on the generalization performance on all datasets we have tried so far. We note that 
these improvements are more significant in those cases where one or several new 
classes are introduced with subsequent datasets. 

It is also worth noting that, Learn-i-H-.MT is more robust than its predecessor. One 
of the reasons why Learn-i-H- is having difficulty in learning a new class when first 
presented is due to difficulty in choosing the strength of the base classifiers. If we 
choose too weak classifiers, the algorithm is unable to learn. If we choose too strong 
classifiers, the training data are learned very well, resulting in very low p values 
which then causes very high voting weights, and hence even a more difficult out- 
voting problem. This explains why Learn-i-H- requires larger number of classifiers or 
repeated training to learn the new classes. Learn-H-.MT, by significantly reducing the 
effect of the out- voting problem, improves the robustness of the algorithm, as the new 
algorithm is substantially more resistant to more drastic variations in the classifier 
architecture and parameters (error goal, number of hidden layer nodes, etc.). 

We should also note however, while we have used MLPs as base classifiers, both 
algorithms are in fact independent of the type of the base classifier used, and can learn 
incrementally with any supervised classifier that lacks this ability. In fact, the classi- 
fier independence of Learn-H- was demonstrated and reported in [5]. 

Further optimization of the distribution update rule, the selection of voting weights, 
as well as validation of the techniques on a broader spectrum of applications are cur- 
rently underway. 
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Abstract. We present a wide experimental work evaluating the behaviour of 
Recursive ECOC (RECOC) [1] learning machines based on Low Density Parity 
Check (LDPC) coding structures. We show that owing to the iterative decoding 
algorithms behind LDPC codes, RECOC multiclass learning is progressively 
achieved. This learning behaviour confirms the existence of new boosting di- 
mension, the one provided by the coding space. We present a method for 
searching potential good RECOC codes from LDPC ones. Starting from a prop- 
erly selected LDPC code, we assess the effect of boosting in both weak and 
strong binary learners. For nearly all domains, we find that boosting a strong 
learner like a Decision Tree is as effective as boosting a weak one like a Deci- 
sion Stump. This surprising result substantiates the hypothesis that weakening 
strong classifiers by boosting has a decorrelation effect, which can be used to 
improve RECOC learning. 



1 Introduction 

Standard ECOC [2] codes are the result of exhaustive searches on sets of random 
codes. No particular combinatorial constraint is assumed for such sets. Both dense 
and sparse random ECOC codes have explored. To some extend, this brute force 
approach is justified considering that the design of optimal ECOC codes is NP hard 
[3]. ECOC codes of this type can be only defined by the enumeration of valid code- 
words. The enumeration yields to their matrix representation. It should be noted, 
however, that ECOC matrices do not convey a true coding structure. As a result, 
ECOC hypotheses' assembling resorts, tightly linked to decoding algorithms, are 
strongly limited. In fact, they are constrained to loss variants [4] of the trivial Mini- 
mum Hamming distance-decoding algorithm. 

In previous work [1][5], we explored ECOC multiclass learning limitations arising 
from this customary approach and proposed an alternative strategy for overcoming 
them. Main observed limitations are the exhaustive search requirement, and the dis- 
agreement between experimental results and the predicted behaviour from coding 
theory. Let us consider experimental results shown in [4]. They do not provide a 
guide for the choice of an ECOC strategy. Ideally, we would like to bound the search 
space with simple but effective constraints so that learning results are at least coher- 
ent. In view of the empirical evidence that random ECOC matrices induce almost 
uncorrelated learners and that randomness is in the core of the families of recursive 
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error correcting codes [6], we proposed the design of ECOC codes using output cod- 
ing transformations defined by recursive error correcting codes. These codes are 
constructed from component subcodes with some imprinting of randomness. Despite 
of their random flavour, a proper coding structure is still preserved. This structure is 
present in each component subcode and is cleverly used by bitwise, soft iterative 
decoding algorithms [7]. In ECOC learning terms, it implies that overall learning is 
spread between binary learners’ working at multiple local multiclass learning con- 
texts, as many as component subcodes. 

The remainder of this paper is organized as follows. In section 2, we briefly review 
RECOC learning models based on LDPC codes [8] [9]. In section 3, we analyze the 
setting of parameters required by RECOC LDPC learning. In section 4, we present 
experimental results. Einally, in section 5, we present conclusions and further work. 

2 Recursive ECOC Learning 

The effectiveness of general error correcting codes strongly depends on the memory- 
less channel assumption i.e. errors must be independent. ECOC matrices arising from 
classic error correcting codes induce correlated binary learners, i.e. channels with 
memory, so that training error degrades [10][1 1]. As one might anticipate, the correla- 
tion effect might be diluted if we allow a divide and conquer approach, i.e. ECOC 
learning machines constructed from and a set of component ones. On each component 
machine only a subset of binary learners could work. Subsets could be constructed 
randomly but keeping their size small so that the probability of detrimental coopera- 
tion between correlated binary learners remains small. These ideas are in the core of 
RECOC LDPC learning machines. LDPC codes are linear codes belonging to the 
family of recursive error correcting codes. Binary random sparse parity check matri- 
ces define them. Eor sufficient large codewords block lengths n , they perform near 
the Shannon limit when decoded with the iterative Sum-Product (SP) [9] algorithm 
over a Gaussian channel. 

Definition (LDPC codes [8]): A Low-Density Parity Check (LDPC) code is speci- 
fied by a pseudorandom parity check matrix H containing mostly 0"s and a small 
number of ones. A binary (n, 7 ,v)l LDPC code has block length n and a parity- 
check matrix H with exact j ones per column and v ones in each row, assuming 
7 > 3 and v> j . Thus, every code bit is checked by exactly j parity checks and every 
parity check involves v codeword bits. The typical minimum distance of these codes 
increases linearly with n for a fixed j and v . 

Let c(x)e C : X Y ,\ f \ = M 2 , be a target multiclass concept explained by a 
training sample S . Let us assume a binary linear code 0 with parameters (n,k,d), 
i.e. codewords of length n conveying k information bits with Minimum Hamming 
distance d between them. Thus, we say that the channel rate is r = ^ . By setting 
k = flog 2 M~\ we can perform output encoding on S by means of a transformation 

* Please note that the ifi j,v) LDPC codes notation characterizing a particular parity check 

matrix is different from the (n,k,d) notation characterizing general linear block codes. 
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r ; y ^ 0 , which uses any subset of M different codewords from 0 . As a result, an 
ECOC learning machine can be constructed. Such a learning machine is defined by 
binary learners L, trained with binary samples 5; , 0 < ; < m - 1 . The set of 5, , 
0 < ;■ < M - 1 , is obtained by output enconding S with 0 . The expectation is that the 
greater redundancy m=n-k in 0 , the stronger protection against errors from binary 
learners L; , 0<i <n-l . Let us assume binary learners’ errors satisfy a symmetric, 
additive, discrete, memoryless channel. Therefore, we admit that the binary hypothe- 
sis /!, issued by learner L, upon seeing the input vector xe X can be modelled as 
follows: 

h^=Ci+Ci 0<i<n-l (1) 

Assuming modulo 2 arithmetic, we are expressing that the true binary hypothesis c, 
is hidden by an additive residual (noise) hypothesis . For the sake of clearness, we 
have omitted the hypotheses ’s dependence on the input vector xe X . Probability of 
binary learners’ errors, p; =P(e; =l)=^0.5, 0<;<n-l, characterizing the learning 
channel can be estimated from training errors. The assembling of RECOC hypotheses, 
i.e. multiclassification, assumes the computation of symbol (bitwise) Maximum A 
Posteriori (MAP) estimates c, * from the vector h of binary noisy hypotheses, the 
estimated vector of binary residual hypotheses p and the coding structure 0 : 

c* = arg Max P [c,. | /t, p,0 ] i = 0,...., n — 1 (2) 

CjS {0 ,1 } 

For LDPC codes the iterative SP decoding algorithm does this work. The process 
involves a fixed number of iterations I. The concluding non-binary hypothesis h^{x) 

is reconstructed by means of T ’(c*) being c*= [c, *] . RECOC LDPC learning ad- 
mits a further parameter eoncerning multiclass learning improvement by binary trans- 
duction. Binary learners can be boosted versions of some fixed weak binary learner 
[12]. The aim is that binary boosting can induce a better learning channel. We are 
aware that constructive boosting effects are limited to the point where overfitting and 
thus dependence between binary learners starts. On the other hand, it is well known 
that boosting strong binary learners produce the counterpart effect: learners become 
weaker. Taking into account that suitable weakened strong learners might provide 
independence between errors and a roughly noise-free learning channel, we explored 
RECOC LDPC learning with boosted Decision Trees. Boosting binary learners before 
output LDPC coding is comparable to a concatenated coding scheme with an inner 
repetition code [13]. A concatenated coding scheme provides further protection 
against noise. Experimental results confirm this interpretation. In addition, consider- 
ing that for a fixed boosting level multielass learning is improved using iterative de- 
coding, it follows that parameter I defines a new boosting dimension, the one pro- 
vided by suitable selected coding spaces. 

3 The Setting of Parameters 

The purpose of this section is to explain the setting of parameters in RECOC LDPC 
algorithms. Let us start with the choice of the r. F^0 . There are many of such 
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mappings and we do not find significant improvements when varying them. Setting 
k = \log 2 M~\ might lead to unexpected results when M . The iterative decoding 
algorithm might supply a codeword disregarded by T .■ T ^ 0 . In these cases, we find 
significant improvements by a further stage of Minimum Hamming Distance decod- 
ing on the wrongly estimated codeword. The choice of the channel rate r = ^ is 
closely related to the behaviour of LDPC codes over the assumed channel. Higher 
channel rates impose stronger constraints on quality of the (learning) channel. Lower 
ones increases the probability of generating repeated columns in ECOC matrices, a 
fact that might explain the unexpected^ learning degradation at lower channel rates i.e. 
with greater redundancy. In our experimental work, we set r = 0. 1 in order to allow 
comparison with recent work in ECOC multiclassification. 



Recoc LDPC Algorithm 
Input 

Sample S with |5| = m^, F = {1,...,M}, k = \log 2 M~\ 

0 - LDPC code with parameters at rate r = — 

n 

r.'T^0, Binary Learner L 

Iterations I for the iterative SP algorithm 

Processing 

Compute training samples 5;, i = 0,...,n-l using Land S 
Train L with samples 5; to obtain L = [L,], i = 0,...,n-l 
Compute p from L and S',-, i = 0,...,n-l 

Output 

hy(x)= r'[5P(r,L, /?,/)] 

End Recoc LDPC 



Picking a good LDPC code in the learning sense is the main matter regarding the 
implementation of RECOC LDPC learning. Let us consider k=5 , which covers do- 
mains with Me [17,32]. Choosing r = 0.1 yields to n = 50 and m = 45 . LDPC parity 
check matrices 7745x50 can be constructed randomly assuming j = 3 ones per column 
and minimum Hamming distance h = 2 between them. From , the generator 

matrix must be obtained. This step requires a matrix inversion operation over a 
square submatrix in . Random construction of LDPC codes does not 

guarantee such inverse exists. However, we can still modify by inserting ones 



^ In 4, sparse codes working at rate r = 0.0625 sometimes behave worse than the dense codes 
working at rate r = 0. 1 . The greater redundancy the lower protection? 
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to allow the inversion. Using yields to the desired ECOC matrix. We only re- 
quire the encoding of source vectors s = representing multiclass labels. A 

challenging coding-learning problem now arises: how can we characterize those 
LDPC codes leading to good ECOC matrices? We find that scoring ECOC matrices 
with their mean Hamming distance between columns and their standard deviation can 
be used to pick some of them. Specifically, we simulated a sufficient large set of 
LDPC codes and compute the average Column Hamming Distances (CHD) and stan- 
dard deviation in the resulting ECOC matrices. Intuitively, we must look for LDPC 
codes generating ECOC matrices with high average CHD and small standard devia- 
tion. Results for the case M =26 are shown in Fig. 1 . 




Deviations from the Average CHD in RECOC Matrices with M=26 




0 0.5 1 1.5 2 2.5 3 3.5 



Fig. 1. Searching good RECOC codes from LDPC ones. Top and bottom figures show histo- 
grams of the average and standard deviation of CHD in RECOC matrices. Middle figure shows 
the relation between both. 



Fig. 1 suggests that average CHD threshold values of 0.35 for maximum standard 
deviation and 12.65 for the minimum average might provide good results as indeed 
happened. An inspection to the whole set of ECOC matrices showed that those with 
too many repeated columns produced higher values of standard deviation. In fact, a 
limited number repeated ECOC columns are allowed in our framework as well as null 
ones. We expect that undesirable effects of correlated (equal!) binary learners are 
hopefully diluted in the divide and conquer approach. On the other hand, introducing 
a reduced amount of labelling noise solved the null column problem. 
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4 Experimental Results 

Learning algorithms were developed using public domain Java WEKA library [14]. 
For LDPC coding, we developed a WEKA extension based on public domain D. J. 
MacKay software [9]. The experimental version is available upon request to the au- 
thors. We evaluated RECOC LDPC performance on 15 UCI data sets. For the sake of 
brevity we present a subset comprising only 9 data sets (see Table 1). Evaluation was 
done with the Test error when a partition was available. Otherwise, we used 10-Fold 
cross validation. Both strong (Decision Trees) and weak (Decision Stumps) were 
subject of study. We recall here that Decision Stumps are basically one level Decision 
Trees. Experimental results showed that about ten iterative boosting steps were 
enough for observing iterative learning improvements with Decision Trees. Convinc- 
ing convergence results happened after a hundred of iterative decoding steps. We 
considered LDPC codes at r = 0. 1 , ) = 3, h = 2. We first simulated 10000 LDPC 
codes with their corresponding ECOC matrices. These matrices were then ranked by 
their average CHDs. We then observed the top five matrices and picked by inspection 
the one exhibiting a good compromise between maximum average CHD and mini- 
mum variance. Regarding questions of stability in the SP algorithm, a threshold value 
of l.e-4 was assumed for all binary-training errors. Results with Decision Tree learn- 
ers are depicted from Fig. 2 to Fig. 4, with individual runs manually clustered, and the 
boosting level shown in the upper right. Representative statistics are shown in Table 
2. For each learning domain, we selected the run with the T value yielding to the 
minimum test or 10-fold CV error along the iterative decoding process. Alternate vely, 
we might fix a number of decoding steps and select T by crossvalidation. Parameter 
selection issues will be matter of future research. For a boosting level T, a channel 
rate r and number of classes M , the number of classifiers involved in a RECOC 
learning machine turns to be r ' x^log^ M”|xr . Representative results with Decision 
Stumps learners are included in Table 2. 



Table 1. Learning Domains - UCI Data Sets. 





#Examples 






Domain 


Train 


Test 


#Attributes 


#Classes 


Dermatology (DE) 


366 


- 


34 


6 


Satimage (SA) 


4435 


2000 


36 


6 


Glass (GL) 


214 


- 


9 


7 


Pendigits (PE) 


7494 


3498 


16 


10 


Vowel (VO) 


528 


462 


10 


11 


Soybean (SO) 


307 


376 


35 


19 


Primary Tumor (PT) 


339 


- 


17 


21 


Audiology (AU) 


226 


- 


69 


24 


Letter (LE) 


16000 


4000 


16 


26 



Results shown in Fig. 2 to Fig. 4 and Table 2 confirm our claim of learning-coding 
consistency. RECOC codes are not constrained by naive dense or sparse combinato- 
rial constraints or by artefacts like the use of pseudo ternary codes. Actually they are 
supported in a powerful coding structure. 
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RECOC LDPC RatesO.1 - Dermatology Data Set 





RECOC LDPC RatesO.1 -Satimage Data Set 






RECOC LDPC Rate = 0.1 - QIassData Set 






Fig. 2. Recoc LDPC r = 0A with Adaboost Decision Tree binary learners on Dermatology, 
Satimage and Glass data sets. The associated boosting level T is shown in the upper right. 
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Decoding Iterations 



Decoding It e ratio ns 



Decoding Iterations 



RECOC LDPC RatesO.1 -PendIgItsData Sat 



■z' V- I, V 



; J _ _.L 






j- - . r - 





Decoding Iterations 



Decoding Iterations 



Decoding Iterations 



Fig. 3. Recoc LDPC r = 0. 1 with Adaboost Decision Tree binary learners on Pendigits, Vowel 
and Audiology data sets. The associated boosting level T is shown in the upper right. 
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RECOC LDPC RatesO.1 -LetlerData Set 




Fig. 4. Recoc LDPC r = 0. 1 with Adaboost Decision Tree binary learners on Letter data set. 
The associated boosting level T is shown in the upper right. 



Table 2. Error % achieved hy RECOC LDPC r = 0. 1 , j = 3 using Decision Trees (DT) with 
boosting steps Tg [1,15] and Decision Stumps (DS) with boosting steps T e [1, 100] . Mean* 
and Std* are computed for I e [75, 100], Min and Max over I G [1, 100]. 



Boosted DT 


Statistics 


% Error 


Domain 


T 


Min 


Max 


Mean* 


Std* 


1=1 


1=40 


1=60 


1=80 


1=100 


DE 


15 


1.91 


3.83 


2.22 


0.1641 


1.91 


2.19 


2.19 


2.19 


2.19 


SA 


8 


9.30 


12.25 


10.10 


0.3138 


11.4 


10.2 


10.4 


9.60 


10.0 


GL 


14 


21.96 


28.04 


24.37 


0.9044 


27.5 


24.3 


24.3 


24.8 


26.7 


PE 


14 


2.92 


3.80 


3.07 


0.0799 


3.26 


3.15 


3.20 


3.03 


3.09 


VO 


11 


48.2 


56.1 


49.81 


0.8610 


50.2 


48.7 


51.3 


48.5 


48.3 


AU 


13 


14.6 


21.7 


16.47 


0.5251 


19.5 


17.2 


17.2 


16.3 


16.3 


LE 


15 


4.85 


7.62 


5.10 


0.0940 


7.62 


5.35 


5.15 


5.17 


5.00 


Boosted DS 


Statistics 


% Error 


PT 


71 


52.5 


61.1 


54.05 


0.5860 


61.1 


55.7 


53.1 


54.5 


54.2 


DE 


15 


1.91 


4.37 


2.35 


0.2060 


4.37 


2.19 


2.19 


2.73 


2.19 


SO 


100 


6.65 


12.5 


10.4 


0.6486 


12.5 


10.9 


10.6 


10.9 


9.57 



Multiclass learning is improved when boosting binary learners. This effect is ob- 
served no matter the improvement is due to the leveraging of weak learners or im- 
proved independence in strong learners. 

We remark that improvements persist even when binary and multiclass training er- 
rors attain perfect classification. Obtained results improve many instances shown in 
[4], and are closed to those in [15]. In addition, they are consistent with the thesis of 
Rifkin [16]. We note in pass, that in this study we have constrained ourselves to 
r = 0. 1 . Higher channel rates might yield to better results considering that the prob- 
ability of repeated columns might reduced as well. We conjecture that the good per- 
formance of the simple One Against All strategy in certain learning domains is due to 
the effective ECOC structure with a set of orthogonal columns. 
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5 Conclusions and Further Work 

We have experimentally shown that RECOC learning is a coding theory complaint 
ECOC learning framework. Knowledge is acquired in distributed an iterative way 
trusting in a well-bounded set of ECOC codes. A methodology for finding such sets 
has been presented. However, we feel that we have only scratch the problem. Clearly, 
the design of suitable scoring systems for qualifying ECOC matrices constructed from 
LDPC codes is an interesting line of research. In addition, it remains an extensive 
study about the behaviour of RECOC LDPC learning in a broad spectrum of channel 
rates. 
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Abstract. A formula is derived for the exact computation of Bagging 
classifiers when the base model adopted is fc-Nearest Neighbour (fc-NN). 
The formula, that holds in any dimension and does not require the ex- 
traction of bootstrap replicates, proves that Bagging cannot improve 1- 
Nearest Neighbour. It also proves that, for k > 1, Bagging has a smooth- 
ing effect on fc-NN. Convergence of empirically bagged fc-NN predictors 
to the exact formula is also considered. Efficient approximations to the 
exact formula are derived, and their applicability to practical cases is 
illustrated. 



1 Introduction 

This note is devoted to the derivation of a formula for the exact computation 
of the Bagging model [1], when the base predictor adopted is the fc-Nearest 
Neighbour (/c-NN) classifier [2]. 

The Bagging method essentially consists in averaging over a(n infinite) se- 
ries of realizations of a same base predictor - in our case, A:-NN. Virtues and 
limitations of methods based on combination of predictors (“ensemble” meth- 
ods), as Bagging, Boosting [3] and, specifically, the Adaboost algorithm [4], have 
been extensively investigated - primarily, from the point of view of prediction 
performance [5,6]. 

fc-Nearest Neighbour is one of the most basic (and oldest) classifiers; it is 
therefore interesting to study how it combines with Bagging. A first question 
is whether Bagging can improve /c-NN at all. The formula we derive proves 
that 1-NN is perfectly equivalent to the bagged 1-NN, while, for /c > 1, we 
are able to show that Bagging has a smoothing (“regularizing”) effect on k- 
NN. This result adds theoretical grounding for a long recognized fact, i.e., that 
Bagging is a variance reducing procedure that works especially well with high 
variance (unstable) base models [7, 1]. While consistency of Adaboost is still 
under scrutiny [8], the formula also implies that the limit exists for empirically 
bagged fc-NN predictors. Convergence of the latter to the exact formula can 
thus be gauged under a variety of conditions. The formula is straightforward to 
implement. Efficient approximations to it can also be derived, which we propose 
for applicability in practical cases. 

In Sec. 2, notations are introduced and background material on fc-NN and 
Bagging is reviewed. Main results are contained in Sec. 3, where detailed deriva- 
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tion of the exact and approximated formulae for the bagged fc-NN predictor is 
provided. Because of space limitations, some proofs are omitted and can be found 
in [9]^. In Sec. 4, application of main results is illustrated in concrete cases. 

2 Background 

Let D = {(xi, be a data set, where the data points x^’s belong to a given 

region, A, of some metric space (X,d), and let the yt’s be class labels. In the 
following, we restrict ourselves to the problem of binary classification, and we 
shall therefore assume that yi G { — 1, 1}, i = 1, . . . , IV. 

In this paper, a predictor is a function returning a class label y for any 
input pattern x & A, that is synthesized from a given set of examples (z.e., a 
data set D), through some specific {learning) process. So, a predictor depends 
on the data set on which it is trained, and we may find convenient to convey 
this dependence by adopting the explicit notation: (j){- ; D), (f) : A — > {—1, 1}. 

For the sake of notational simplicity, we will occasionally employ the same 
letter, D, also to denote the set of training points, x^’s, D = {xi, . . . ,xat}. 

Nearest Neighbour is a predictor over region A that assigns to point x G A 
the label associated with the data point x^* G D which is closest (with respect 
to the metric of A) to x. In the same spirit, a fc-NN classifier model is obtained 
by considering the first k data points nearest to point x, and classifying point x 
as belonging to class 1 or —1 according to a majority voting criterion (for which 
it is customary to take k odd) [2] . 

Bagging is an ensemble method for building classifier predictors, which com- 
bines those resulting from training a same base predictor over a suitable collec- 
tion of training sets, Df,. More specifically, let Db be a collection of data sets, 
each lying in ^ x {—1, 1}. An aggregated predictor (I>a{') is obtained by train- 
ing the same base model over the D^s and combining the different predictors 
obtained through a majority voting criterion. Let x G A, and let B\ indicate 
the fraction of Z?h’s for which i^(x; Db) = 1; then <()^(x) = 1 if > B/2, and 
4>a{^) = otherwise. 

Usually, however, only a single training set D is available. The Bagging strat- 
egy [1] consists in building a collection Db by Bootstrap [10] resampling: B repli- 
cated data sets Db, b = 1, . . . , B are obtained by random sampling with replace- 
ment exactly N cases from the original data set D. The resulting predictor, that 
we denote with 4>b{-), is an approximation of the aggregated predictor (1>a{')- 

As it is well known, the probability, tt{N), that any given sample is contained 
in a bootstrap replicate of a given data set of cardinality N is given by 7r(iV) = 
1 — (1 — -^)^ ■ As N increases, 7r(iV) rapidly converges to a value slightly greater 
than 0.632. For the sake of notational simplicity, in the following we drop the 
explicit dependence of tt from N. 

3 Main Results 

Here we derive the exact formula for the computation of the bagging predictor 
when the base predictor adopted is A:— NN, for any k. In particular, for any input 

^ Available online at the following address: http://mpa.itc.it/mcs2004/bagging-knn.ps 
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pattern x e A we will compute = 1), the probability that the Bagging 

predictor map x to class 1. We start with the k = 1 case, and extend then the 
argument to fc > 1. 

In all that follows, the training set D is intended to be re-sorted in ascending 
order of distance from x: Z? = {xi, X 2 , . . . , x^r}, where c?(xi,x) < d(x 2 ,x) < 
• • • < d(xAT,x). 



3.1 The 1-NN Case 

In the case fc = I we will compute the probability with respect to the class yi 
of the first nearest neighbor xi, i.e., we will compute P(</>b(x) = yi). We start 
by computing the probability P{4>(x; Db) = yi), where Dt is a generic bootstrap 
replicated data set of D. The following result holds: 

Proposition 1. Let x G A be any given input pattern and let Db be a generic 
bootstrap replicated data set. The probability of mapping x to the class yi of 
its first nearest neighbor by means of a 1— AW predictor is at least tt, i.e., 
P{(f{x;Db) = yi) > TT. 

Proof. The probability that a given data point Xi G D determines the class to 
be assigned to x is pi = 7t(1 — 7t)®“^, i.e, the probability, tt, that x^ is extracted 
in the replicated data set Db, multiplied by the probability, (1 — that no 

points {xi, . . . ,Xi_i} are extracted that are closer to x than Xi. 

The probability of mapping x to class yi is therefore given by the sum of the 
probabilities pi over the points of class yi. Let J = {zGN|pi=j/i, l<i< N} 
be the set of indices of the points of class yi, then 

P{(j){x; Db) = yi) = tt'^{ 1 - TTf-^ . (1) 

iei 

By isolating the contribution of the first nearest neighbor xi, Eq. (1) can be 
rewritten as 



P{(j){x; Db) = yi) = TT + TT^^il - tt)" ^ > tt. □ 

iei 

i>l 

We now compute the probability that the bagging predictor agrees at point x 
with the prediction of the 1-NN. 

Proposition 2. Let x G A be any given input pattern. The probability of map- 
ping X to the class y\ of its first nearest neighbor by means of a bagging 1 — AW 
predictor satisfies the following relationship: 

lim P((jie(x) =yi) = I . 

B— *^oo 

Proof. Let us suppose that a collection of B different versions of the base pre- 
dictor obtained by learning over Bootstrap-replicated data sets is available. The 
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evaluation of predictors at x, Db)}b=i,...,B, is equivalent to random labels 
generation (either —1 or 1) from a binomial distribution b(n,p). The parame- 
ters of the distribution are n = B, the number of replicated data sets, and the 
probability p is given in Eq. (1) (considering that the outcome yi correspond to 
“success” ) . 

By the law of large numers we know that the proportion of successes (out- 
comes yi) converges to p in probability as B grows to infinity. Combined with 
Prop. 1, this ensures that the proportion of outcomes yi is at least tt if B is 
sufficiently large. As a consequence: lim P((/>b(x) = yi) = 1. □ 

The interpretation is immediate: 1-NN and (infinitely) bagged 1-NN coincide. 



3.2 The fc-NN Case 

We now extend the result of the previous section to the fc-NN case, for fc > 1 . The 
overall strategy is the same: first we compute the probability P{(j){x; Db) = 1), 
and then we exploit the law of large numbers. The derivation of the formula for 
P{4>{x; Db) = 1) follows the same ideas as above, only we have to consider the 
probability of extraction of subsets of data points, rather than single points. To 
this end, the class ar of subsets of D is first introduced. 

Definition 1. The class ar is the set of all the replicated data sets containing 
exactly k—1 data closer to x than x^, and containing x^ as the k-th closest point 
to X.- 

ar = {Db C D \ Db = {xjj, . . . ,Xij^_j,Xr, . . . }, ij < r, 1 < j < A: - 1}. 

Remark 1. Let V{D) be the power set of D and let Vk{D) C V{D) be the the set 
of all subsets of D containing at least k elements: {ar}r=k,...,N is a partition, i.e., 

N 

a disjoint covering, of Vk{D). This means that [J c^r = Pk{D) and C Og = 0 

r—k 

for r yf s. 

Proposition 3. Let x € A be a given input pattern and let Db he a generic 
bootstrap replicated data set. The probability of mapping x to class 1 by means 
of a k—NN predictor is 

N 

P(<^(x; Db) = 1) = ^ P(</>(x; Db) = 1 \ Db G a,)P(A G a,). (2) 

r—k 

Proof. For fixed r, let P{Db G Or, <?^’(x; Z?h) = 1) be the joint probability of 
extracting a bootstrap data set belonging to Ur leading to majority vote 1. 
Since the classes ar are a partition of Vk{D), the probability that a generic 
bootstrap replicated data set Db leads to majority vote equal to 1 is obtained 
by summing P{Db G ar, 4>(x; Db) = 1) over r. Eq. (2) follows immediately from 
the definition of conditional probability. □ 
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Our aim is now to provide expressions for the two terms in the summation. To 
this end, let us introduce the following 

Definition 2. S'(r) is number of elements of Ur leading to majority vote !.• 

S{r) = Card({L>{, G Ur \ 4>{n;Db) = 1}). 

Remark 2. The probability that a bootstrap replicated data set belonging to ar 
leads to majority vote 1 is 

P(</)(x; A) = 1 I A G a.) = (3) 

and the number of elements belonging to class ar is 

Card(a,) = 2^-’'(^^3jy (4) 

In fact, the binomial coefficients accounts for all the possible choices of fc — 1 
points from the first r — 1 (x^ is fixed) and the term 2^“'’ accounts for all the 
possible combinations of the remaining N — r points of D. 

As for the term S{r), the following result holds (proof in [9]): 

Proposition 4. The number of elements of ar leading to majority vote 1 is 
S{r) = 2^~^S{r) where 




and vi{r — 1) and n-i{r — 1) indicate the number of points in Dr-i = {xi, . . . , 
Xr-i}, respectively labelled with 1 and —1. 

For each r, computation of S{r) involves dealing with summation of binomial 
coefficients, which, as N grows, makes Eq. (3) computationally impractical. In 
[9] a more efficient, recursive way for computing S(r) is presented. 

We work now at the derivation of the second probability term in the summation 
of Eq. (2). To this end, let Dh{r) = {xj^ , . . . , Xi^_^ , x^, . . . } be a given element 
of ar, i.e, the indices ij are fixed; moreover, let P{Db{r)) be its probability of 
extraction. 

Proposition 5. The probability P{Db G ar) of extracting a generic bootstrap 
replicated data set Db belonging to the class ar is 

P{Dbear)=(^^l^^P(^). 

Proof, immediate, considering that the binomial coefficient accounts for all the 
of possible choices of fc — 1 points from the first r — 1 and x^ is fixed. □ 
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Putting together Prop. 4 and 5, Eqs. (3) and (4) we can rewrite Eq. (2) as: 

N 

P(0(x; D,) = l) = J2 S{r)P(W) (5) 

r—k 

As for the term P{Dh{r)), the following result holds (proof in [9]): 

Proposition 6. Let {ui, . . . , u„, Vi, . . . , v^} Q D be a subset of the training 
data. The probability of extraeting a bootstrap replieated data set Dt containing 
the Uj ’s and not containing the Vj ’s is 



P(ui C Dbt ■ ■ ■ j C Db: ^ L)b^ ■ • ■ ^ Pfc) — PnO 



i=0 



where pi = 1— 1 — 



1 



N 



N-i 

be computed recursively as it follows: 
Plj — Pm-\-j 

Pi-1 j+1 



( according to this definition Po = tt) and P„o can 
j = 0, . . . , n - 1 






(1 Pm) 1 Pi—1 j ^ — 2, . . . . 



j = 



r — k—1 



Therefore, P{Dh{r)) = P^o (1 —Pi)-> and Eq. (5) can be rewritten as 

N r-k-1 

P(^(x;P>b) = 1) = Pfco^5(r) (1-p,). (6) 



r—k 



2=0 



Now we can finally state our main claim: 



Proposition 7. Let x € A be any given input pattern. The probability of map- 
ping X to class 1 by means of a bagging k—NN predictor satisfies the following 
relationship: 



lim P{4>B{yi) = 1) = S 

B—^oc) 



[\if P((/)(x; Db) = 1) > i 

\if P{(j){x-,Db) = l) = \ 

0 otherwise. 

Proof. The proof follows the same argument as in Prop. 2. 
Remark 3. It is interesting to notice that 



□ 



lim P{Db{r))=^^{l-TiY-^. (7) 

N—^OC) 

Beside providing a much simpler expression for P{Db{r)), this fact has an in- 
teresting interpretation: in the limit N —>■ oo, the probability of extracting a 
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given replicated data set, Db{r), is equal to the probability, tt^, of extracting 
Xij , . . . , and Xr, multiplied by the probability (1 — of not extracting 

the remaining (r — 1) — (fc — 1) elements. This is the same as saying that, for 
large N, extractions of samples from the data set can be regarded as independent 
events. 

Eq. (7) allows to write an approximated formula for Eq. (6): 

N 

P{<P{x; Db) = l) = n^Y. ( 8 ) 

r—k 

In Sec. 4, applicability of Eq. (8) in place of the exact one is illustrated in 
practical cases. 

Remark 4- The smoothing effect of Bagging becomes apparent as we rearrange 
terms of Eq. (6) as it follows 



P((/)(x; Db) = 1) = PkoS{k) + Pkojl -po)S{k + 1) 

k terms fc+1 terms 

+ T’fco(l ~Po)(l~Pi)<5(fc + 2) + ..., 

V“ 

fc+2 terms 



m — 1 

and observe that P„o Pi) is ^ decreasing function of both n and m. This 

z— 0 

fact implies that the main contribution to P{4>{x; Db) = 1) is given by the first 
k terms, but also all the other terms contribute, in a decreasing way, to it. 

Remark 5. Formulas for P{(j)(x] Db) = 1) do not account for the bootstrap repli- 
cated data sets containing less than k elements. However, let Db = {xi^ , . . . , Xi^ }, 
with s < k. The probability of extracting any such data set is: 

//V\ N-r-l 

r=l ^ '' i—0 

This means that, for N large enough, it is a negligible probability. 

4 Examples 

Two applications of formulas derived in Sec. 3 are now illustrated. 

4.1 Task 1 

The first task consists in comparing estimates of P(</>b(x) = 1) as provided by 
the exact formula (6), with those provided by the approximation (8) of it and by 
empirically bagged /c-NN predictors. Moreover we will also consider truncated 
versions of Eq. (8): 
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Table 1. Misclassification rates of different fc-NN predictors as compared to the exact 
one of Eq. (6). Columns denoted with ‘E’ refer to empirically bagged predictors built 
by aggregationg 20, 50, 100, 200, 500 base models. ‘Appr.’ denotes the approximate 
predictor of Eq. (8), while “Tr.” denotes predictors derived from the approximate one 
by truncating the summation at N* = 8, 12, 16, 20 terms. 



k N 


E 20 E 50 E 


100 E 200 E 500 Appr . 


Tr . 8 Tr . 12 Tr 


16 Tr 


20 


1 20 


1.9 


0.2 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


1 50 


2.5 


0.3 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


3 20 


6.8 


4.3 


3.6 


2.0 


1.3 


1.0 


1.2 


0.9 


1.0 


1.0 


3 50 


8.9 


7.4 


5.4 


5.0 


2.5 


0.3 


2.2 


0.3 


0.3 


0.3 


5 20 


7.4 


4.6 


4.0 


2.6 


2.4 


0.9 


15.5 


1.5 


1.0 


0.9 


5 50 


6.8 


4.0 


2.8 


1.8 


1.2 


0.3 


16.0 


0.7 


0.3 


0.3 


7 20 


7.9 


4.5 


3.3 


2.4 


1.6 


1.4 


48.4 


12.1 


1.7 


1.4 


7 50 


7.8 


4.5 


3.4 


3.0 


1.7 


0.3 


50.4 


12.3 


1.0 


0.2 



N * 

P(()i(x; Db) = 1) = tt'^ ^ 5(r) (1 - tt)’'"'', with k < N* < N. 

r—k 

To this end, N labels in { — 1, 1} are randomly generated as to simulate sorting 
of data points in ascending distance from a “very difficult” test point (it is 
in the proximity of the class boundary that labels of neighbours are mostly 
affected by noise). For each such sequence of labels, a true class is computed as 
the one returned by the exact formula. The true class is then compared with 
those returned by approximated formulas and empirical bagging as obtained by 
aggregating an increasing number of fc-NN predictors. Results reported in Tab. 
(1) are averaged over 1000 runs of the procedure. 

The first thing to notice is that, for any reported k and N, the disagreement 
between empirical bagging and the exact formula decreases as the number of 
aggregated models is increased. While the task is somewhat extreme, it is inter- 
esting to see how, for fc > 1, 500 aggregated models never suffice to reduce the 
disagreement with the exact formula to within 1%. Secondly, outcomes of the 
approximated formula are in excellent agreement with those of the exact for- 
mula (as expected by Eq. (7), the agreement is remarkably better for N = 50). 
Finally, the approximated formula can be truncated drastically (depending on 
k), without much affecting the outcomes. Bold figures in Tab. (1) refer to the 
minimum number of terms in the summation necessary to perform better than 
the empirically bagged predictor built with 500 aggregated models. 



4.2 Task 2 

The second task consists in comparing the classification performance of the exact, 
approximate and empirically bagged predictors. The example reported refer to a 
relatively simple, two-dimensional problem: 4 sets of points, 25 points each, were 
generated by sampling 4 two-dimensional Gaussian distributions, respectively 
centered in (—1, 1/2), (0, —1/2), (0, 1/2) and (1, —1/2). Covariance matrices were 
diagonal for all the 4 distributions; variance was constant and equal to 2/5. Points 
coming from the sampling of the first two Gaussians were labelled with class — 1; 
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Table 2. Misclassification rates of different fc-NN predictors. Notations are the same 
of Tab. 1, except for the second column, which refers to the plain (unbagged) fc-NN 
predictor, and the column “Exact”, which refers to the exact formula (6). 



k fc-NN E 20 E 50 E 100 E 200 E 500 Exact Appr. Tr. 8 Tr. 12 Tr. 16 Tr. 20 



1 


13.2 


13.0 


13.28 


13.2 


13.2 


13.2 


13.2 


13.2 


13.2 


13.2 


13.2 


13.2 


3 


13.84 


10.24 


9.36 


8.84 


8.84 


8.68 


8.72 


8.72 


8.76 


8.72 


8.72 


8.72 


5 


9.0 


6.64 


6.36 


6.04 


6.2 


6.24 


6.12 


6.08 


8.2 


6.24 


6.08 


6.08 


7 


6.64 


5.72 


5.64 


5.56 


5.36 


5.4 


5.32 


5.32 48.72 


5.8 


5.36 


5.32 




Fig. 1. Class boundaries (solid lines) obtained by applying a) plain 5-NN, b) bagged 
5-NN predictors. The dashed lines correspond to the Bayes decision rule. 



the others with class 1. Misclassification rates of predictors are measured against 
the optimal Bayes predictor on 2500 test points lying on an evenly spaced grid. 
Results are collected in Tab. 2 

First we notice that, as expected, for fc = 1 the plain /c-NN predictor performs 
as the exact bagging predictor (see Prop. 2). Second, for fc > 1 bagged predic- 
tors perform consistently better than plain fc-NN. Thirdly, as the number of 
aggregated models increases, the empirical bagging predictors converge rapidly 
to the exact predictor. In fact, 100 aggregated models are sufficient to match 
the performance of the exact formula. For what concerns the approximated and 
truncated predictors, the agreement with the exact formula is excellent, and 
perfectly consistent with that observed in Case 1 above. 

In Fig. 1 the class boundaries obtained by applying plain 5-NN and bagged 
5-NN are shown: the misclassification error is of about 9% for plain 5-NN and 
6% for bagged 5-NN (see Tab. 2). The figure clearly shows the smoothing effect 
obtained by applying bagging which allows, in this noisy problem, to improve 
predictions. The same behaviour was also observed for fc equal to 3 and 7. 

5 Conclusions 

Let us finally summarize the main points of this paper: 

a. for fc-Nearest Neighbour predictors. Bagged models can be computed in exact, 
closed form; 



Exact Bagging with fc-Nearest Neighbour Classifiers 



81 



b. the formula shows that: (1) Bagged 1-NN is perfectly equivalent to plain 

1-NN; (2) for A: > 1, Bagging has a smoothing effect on fc-NN; 

c. efficient approximations to the formula can be derived, which prove effective 

in typical cases. 

As for further developments, we believe it would be interesting to investigate 
whether the method can be extended to predictors other than fc-NN. To this 
end, let us notice that the partition {ar}r=k,...,N of the set Vk{D) does not 
depend on the specific predictor (f> adopted. The same consideration holds for 
Prop. 3, which allows decomposing the probability Db) = 1) in terms of 

P((/)(x; Db) = I \ Db & «r) and P(«r G Db), as well as for Props. 5 and 6, which 
allow computing P{ar G Db). Thus, at least in principle, everything reduces to 
estimating P{(j){x; Db) = I \ Db & ar) for the predictor of choice. A promising 
next step may consist in deriving controlled approximations for kernel-based 
predictors as Radial Basis Functions or Support Vector Machines. 
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Abstract. In this paper, we present a maximum entropy (maxent) ap- 
proach to the fusion of experts opinions, or classifiers outputs, problem. 
The maxent approach is quite versatile and allows us to express in a 
clear, rigorous, way the a priori knowledge that is available on the prob- 
lem. For instance, our knowledge about the reliability of the experts and 
the correlations between these experts can be easily integrated: Each 
piece of knowledge is expressed in the form of a linear constraint. An 
iterative scaling algorithm is used in order to compute the maxent so- 
lution of the problem. The maximum entropy method seeks the joint 
probability density of a set of random variables that has maximum en- 
tropy while satisfying the constraints. It is therefore the “most honest” 
characterization of our knowledge given the available facts (constraints). 
In the case of conflicting constraints, we propose to minimise the “lack 
of constraints satisfaction” or to relax some constraints and recompute 
the maximum entropy solution. The maxent fusion rule is illustrated by 
some simulations. 



1 Introduction 

The fusion of various sources of knowledge has been an active subject of research 
since more than three decades (for some review references, see [2], [6], [8]). It 
has recently been successfully applied to the problem of classifiers combination 
or fusion (see for instance [13]). 

Many different approaches have been developped for experts opinions fusion, 
including weighted average (see for instance [2], [8]), Bayesian fusion (see for 
instance [2], [8]), majority vote (see for instance [1], [12], [16]), models coming 
from incertainty reasoning: fuzzy logic, possibility theory [14] (see for instance 
[3]), standard multivariate statistical analysis techniques such as correpondence 
analysis [18], etc. One of these approaches is based on maximum entropy model- 
ing (see [17], [19]). Maximum entropy is a versatile modeling technique allowing 
to easily integrate various constraints, such as correlation between experts, reli- 
ability of these experts, etc. 

In this work, we propose a new model of experts opinions integration, based 
on a maximum entropy model (for a review of maximum entropy theory and 
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applications, see for instance [7], [9], [10] or [11]). In this paper, we use the term 
“experts opinions”, but it should be clear that we can use exactly the same 
procedures for “classifiers combination”. In other words, we could substitute 
“experts” by “classifiers” everywhere. 

Here is the rationale of the method. Each expert expresses his opinion about 
the outcome of a random event, y = i, in the form of an a posteriori probability 
density, called a score. These scores are subjective expectations about this event. 
We also suppose that we have access to a reliability measure for each expert, 
for instance in the form of a probability of success, as well as a measure of the 
correlation between the experts. Each of these measures are combined properly 
by maximum entropy in order to obtain a joint probability density. Let us recall 
that the maximum entropy density is the density that is “least informative” while 
satisfying all the constraints; i.e. it does not introduce “extra ad hoc information” 
that is not relevant to the problem. Once this joint density is found, we compute 
the a posteriori probability of the event by averaging all the possible situations 
that can be encountered, i.e. by computing the marginal P(t/ = f|x), where x is 
the feature vector on which we base our prediction. 

While the main idea is similar, our model differs from [19] in the formu- 
lation of the problem (we focus on quantities that are relevant to classification 
problems, and can easily be computed for classifiers: success rate, degree of agree- 
ment, etc) and in the way the individual opinions are aggregated. Furthermore, 
we also tackle the problem of incompatible constraints; that is, when there is no 
feasible solution to the problem, a situation that is not mentionned by [19]. 

Section 2 introduces the problem and our notations. Section 3 develops the 
maximum entropy solution. Section 4 presents some simulations results. Section 
5 is the conclusion. 

2 Statement of the Problem 

Suppose we observe the outcome of a set of events, x, as well as a related event, 
y, whose outcomes belong to the set {l,2,...,n}. We hope that the random 
vector X provides some useful information that allows to predict the outcome of 
y with a certain accuracy. 

We also assume that domain experts {m experts in total) have expressed their 
opinion on the event y, based on the observation of x: We denote by d{k) = i, 
with i G {1,2, ...,n} and k = l,...,m, the fact that expert k chooses the 
outcome or alternative f - in other words, he takes decision i. In this framework, 
P(d(fc) = t|x) will be interpreted as the personal expectation of the expert, 
i.e. the proportion of times a given expert k would choose alternative i, when 
observing x (for a general introduction to the concept of subjective probabilities, 
see [16]). 

Our objective is to seek the joint probability density of the event, y = i, as 
well as the experts opinions, d{k) = ik- 

P(y = b c?(l) = ii,d{2) = Z2, . . . , d{m) = Zmjx) (2.1) 
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This joint probability density will be estimated by using a maximum entropy 
argument that will be presented in the next section. 

Prior knowledge on the problem, including expert’s opinions, will be ex- 
pressed as linear constraints on this joint density (2.1). In our case, there will 
be four different types of constraints (detailed in the four following subsections) : 
constraints ensuring that (2.1) is a probability density (it sums to one), con- 
straints related to the opinion of the experts, constraints related to the re- 
liability of the experts, and constraints related to the correlation between 
experts. 

2.1 Constraints Inducing a Probability Density 

The first constraint simply states that the joint density sum to one: 

n 

P(j/ = t,fi(l) = ii,d(2) = i 2 , . . . ,d(m) = t„|x) = 1 (2.2) 

This constraint will be called the sum {sum) constraint. Of course, we should 
also impose that the joint density is always positive, but this is not necessary 
(maximum entropy estimation leads to positive values, so that this constraint 
will be automatically satisfied). 

2.2 Constraints Related to the Opinions of the Experts 

Here, we provide information related to the expert’s opinions. We will consider 
that each expert expresses his opinion about the outcomes, according to the 
observation x: 

P(d(fc) = ifc|x) = Tr{d{k) = ife|x) for fc = 1 . . . m, Zfe = 1 . . . n (2.3) 

where Tr{d{k) = Zfe|x), the likelihood of choosing alternative ik, is provided by ex- 
pert k for each outcome ik G {1, 2, . . . , zz}. In other words, each expert provides 
his likelihood of observing outcome ik according to his subjective judgement. 
It indicates that, in average, expert k would choose alternative ik with prob- 
ability n{d{k) = Zfe|x) when he observes evidence x. Of course, in the case of 
classifiers combination, each Tr{d{k) = Zfc|x) would correspond to classifier’s out- 
puts (assuming that each classifier provides as outputs a posteriori probabilities 
of belonging to a given class). This constraint will be called the opinion (op) 
constraint. Notice that (2.3) can be rewritten as 

P(z/ = z,(i(l) = zi, . . . ,d(m) = z„|x) = 7r(d(A:) = Zfe|x) (2.4) 

for k = 1 . . . m and = 1 . . . n. Or, equivalently, 

S{ik - jk)F{y = i,d{l) = ii,. . . ,d{m) = im\x) = 7T{d{k) = jk\:x.) (2.5) 

where S is the delta of Kronecker. This form (involving Kronecker deltas) is 
better suited for maximum entropy computation (see for instance [9]). 




Yet Another Method for Combining Classifiers Outputs 



85 



2.3 Constraints Related to the Reliability of the Experts 

Some experts may be more reliable than others. We can express this fact by, 
for instance, recording the success rate of each expert. This can be expressed 
formally by 

P(y = h d{k) = i|x) = Z\(fc|x) for /c = 1 . . . m (2.6) 

i 

Z\(fc|x) can be interpreted as the success rate for expert k, the probability of 
taking the correct decision (the probability that the opinion of the expert and 
the outcome of the event agree) when observing x. If Z\(fc|x) = 1, expert k is 
totally reliable in the sense that the judgement of the expert and the outcome 
of the experiment always agree. On the other hand, if A(k\x) = 0, expert k is 
always wrong (he always disagrees with the outcome of the experiment). Now, 
if the reliability is only known without reference to the context x (or we do 
not have access to this detailled information) we could simply state that it is 
independent of x which, of course, is much more restrictive: 

p(j/ = i, d{k) = i|x) = A{k) for fc = 1 . . . m (2.7) 

i 

Where we do not require knowledge of the probability of success for all situ- 
ations X. This constraint will be called the reliability (reZ) constraint. (2.7) can 
be rewritten as 

5{i-ik)Y’{y = i,d{l)=ii,...,d{m)=im\y^)= A{k) (2.8) 

, . . .,im 

2.4 Constraints Related to the Correlations between Experts 

It is well known that experts opinions can be correlated. A possible choice for 
modeling experts correlations would be to provide 

p(d(fc) = i, d{l) = t|x) = a{k, ?|x) for /c, Z = 1 . . . m (2.9) 

i 

It corresponds to the probability that expert k and expert I agree. If a{k, Z|x) 
= I, expert k and expert I always agree (they are totally correlated), while if 
a{k,l\x) = 0, they always disagree. If we only know the correlation without 
reference to the context x, we must postulate independence with respect to the 
context, i.e. 

p(cZ(fc) = i, d{l) = t|x) = a{k, 1) for k,l = 1 ... to (2.10) 

i 

This constraint will be called the correlation (cor) constraint. Once more, we 
can rewrite (2.10) as 

<5(ife - q)P(y = t,d(l) = ii, . . . ,cZ(to) = Zm|x) = cr(fc, Z) (2.11) 




86 



Marco Saerens and Frangois Fouss 



We will now see how to compute the joint probability distribution satisfying 
the set of constraints (2.2), (2.3), (2.6), (2.9). 

In the case of classifiers combination, the values of A and a should be readily 
available based on statistics recorded on a training set or previous classification 
tasks. 



3 The Maximum Entropy Approach 

3.1 A Score of Aggregation for Expert’s Opinions 

As already stated, we would like to estimate the joint probability density 

P(y = i, c^(l) = i\,d{2) = Z 2 , . . . , d{m) = z^lx) (3.1) 

satisfying the set of constraints (2.2), (2.3), (2.6), (2.9). The maximum entropy 
estimate of (3.1) will be denoted by 

P(y = z, d(l) = zi, d(2) = Z 2 , . . . , d{m) = z^jx) (3.2) 

with a hat. From this joint density, (3.2), we will compute the a posteriori prob- 
ability of the true outcome y = i 

P(z/ = z|x)= ^ P(j/ = z,d(l) = zi, . . . ,d(m) = Zm|x) for z = 1 . . .n (3.3) 

• •t'im 

and this score will define our score of aggregation for expert’s opinions. It 
represents the probability of outcome y = i satisfying all the constraints provided 
by the experts and based on the estimated density that has maximum entropy. 
In equation (3.3), we average on all the possible situations that can appear in 
the context x; that is, on all the different decisions of the experts, where each 
situation is weighted by its probability of appearance. 

Notice that, in a different framework, Myung et al. [19] proposed to compute 
the a posteriori probability of y conditional on expert’s probability of taking 
a given decision. This is, however, not well-defined since the experts provide a 
subjective probability density and not a clear decision: Tr{d{k) = Zfejx) is not a 
random variable; the authors are therefore conditioning on a probability density 
and not an event. 

If we define d = [d(l), d(2), . . . , d(m)]'’" and i = [zi, Z 2 , • ■ • , Zm]"'", we can 
rewrite (3.3) in a more compact way as 

P(y = zjx) = ^ p(z/ = z, d = ijx) for z = 1 . . . n (3.4) 

i 

We will now see how to compute these scores thanks to the maximum entropy 
principle. 
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3.2 The Maximum Entropy Estimate 

Our aim is to estimate P{y = i,d = i|x) by seeking the probability density that 
has maximum entropy 

/ = - ^ p(i/ = d = i|x) log [p(y = i, d = i|x)] (3.5) 

z,i 



among all the densities satisfying the constraints (2.2), (2.3), (2.6), (2.9). We 
therefore build the Lagrange function 



2,i=l 
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and compute the maximum with respect to the P{y = j,d = j|x). This problem 
has been studied extensively in the litterature. We show in [5] that P(j/ = i, d = 
i|x) takes the form 



P(y — z, d — i|x) — G sum Gop{k^ ik) 

fe=i 

m mm 

X n {Grei{k)f^-^'‘^ n n (3.6) 

k=l k=l 1=1 

where the parameters Gsum, Gop, Grei, Gear can be estimated iteratively by an 
iterative scaling procedure (see next section). 



3.3 Computing the Maximum Entropy Estimate 

The first step is to verify that there is a feasible solution to the problem at hand. 
Since all the constraints are linear, a linear programming procedure can be used 
to solve this problem. 
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Once we have verified that there is indeed a feasible solution, an iterative 
scaling procedure allowing to estimate the Gsum, Gop, Grei, Gear can easily be 
derived (see for instance [4]). The iterative scaling procedure aims to satisfy 
in turn each constraint, and iterate on the set of constraints (as proposed by 
many authors, for instance [15]). It has been shown that this iterative procedure 
converges to the solution provided that there exists a feasible solution to the 
problem (that is, the set of constraints can be satisfied). Indeed, the entropy 
criterion is convex and the constraints are linear so that convex programming 
algorithms can be used in order to solve the problem. 

3.4 The Case Where There Is No Feasible Solution 

It can be the case that no solution satisfying the constraints (2.2), (2.3), (2.6), 
(2.9) exists. This means that there is a conflict between the different estimates 
7T, A, a, so that this situation cannot normally appear in reality. In that case, 
the user of the system should revise his different pieces of knowledge or data. 

However, despite this conflicting situation, if the user nevertheless wants to 
compute an aggregated score, we have to relax in some way the set of constraints. 
There are two different ways of doing this: (1) by introducing slack variables 
that compute the lack of constraint satisfaction, or (2) to relax the equality 
constraints by providing intervals instead of exact values. These two approaches 
are introduced in the two next sections. 



Introduction of Slack Variables. In this case, some equality constraints are 
relaxed. For instance, let us consider that we are willing to relax the reliability 
and correlation constraints for all experts. By introducing slack variables, 

j measuring the lack of constraint satisfaction for each constraint, 

we have 



= i,d{k) = i\x) + £,1'^ - = Z\(fc|x) for A: = 1 . . . m (3.7) 

i 

^ P(d(fc) = i, d{l) = tjx) + ^|x) for fc, Z = 1 . . . m (3.8) 



I 



<0 (3.9) 

where 6* is a treshold provided by the user: the slack variables are not allowed to 
exceed this treshold. Consequently we want to minimize the lack of constraint 



satisfaction min 






subject to constraints (2.2), 



k k,l J 

(2.3), (3.7), (3.8), (3.9). This is a standard linear programming problem. 



Introduction of Intervals. Another alternative would be to relax the equality 
constraints by providing intervals instead of exact values. Once again, let us 
consider that we are willing to relax the reliability and correlation constraints for 
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all experts. The problem would be reformulated as a maximum entropy problem 
with inequality constraints: 

Z\“(fc|x) < ^ p(j/ = d{k) = i|x) < Z\’*‘(A:|x) for A: = 1 ... to (3.10) 

i 

a~{k, ?|x) < p(d(fc) = i, d{l) = t|x) < cr"'"(fc, l\x) for k,l = 1 .. .m (3.11) 

i 

Once more, numerical procedures related to iterative scaling can be used in 
order to compute the maximum entropy solution [4]. We have to maximize (3.5) 
subject to constraints (2.2), (2.3), (3.10), (3.11). 

4 Simulation Results 

For illustration purposes, we used the proposed combination rule in two different 
conditions. For each condition, we compute the following values: 

Maximum entropy. If there is a feasible solution, we compute the maximum 
entropy solution to the problem. 

Linear programming. If there is no feasible solution, we compute the linear 
programming solution to the problem (see (3.4)). 

Weighted average. We also compute a weighted average solution, for compar- 
ison. 

The weighted average is computed as follows. For each expert, we associate 
a weight, w{k), which is a normalised (it sums to one) measure of his reliability 
and we assume that each reliability, Z\(fc|x), is greater than 0.5 (the expert 
performs better than a random guess): w{k) = (4\(fc) — 0.5) / ^j.{A{k) — 0.5). 
The weighted average score is therefore Score{i) = w{k) Tr{d{k) = i|x). 

Notice that we did not introduce correlation constraints in this set of simu- 
lations. For all simulations, we consider the case where there are three experts 
{1, 2, 3} and two outcomes {0, 1}. 

4.1 First Set of Simulations 

We set, for experts’ opinions, 7r(d(l) = 0|x) = 0.3, 7r(d(l) = l|x) = 0.7, 7r(d(2) = 
0|x) = 0.3, 7r(d(2) = l|x) = 0.7, 7r(d(3) = 0|x) = 0.8, 7r(d(3) = l|x) = 0.2. In 
other words, the two first experts agree, while the third one has an opposite 
opinion. For experts’ reliability, we set, 2\(1) = 0.7, A{2) = 0.7, 2\(3) = z, where 
0.5 < z < 1. The results are shown in Figure 1 (a), where we display P(y = 0|x) 
and Score{0) in terms of the reliability of expert 3, i.e. z. 

We observe that when the reliability of expert 3 is high, the fusion rules (that 
is, the experts combination rules) favour outcome 0. Notice that when z > 0.8, 
there is no feasible solution, that is, the constraints cannot be satisfied. Notice 
also that the maximum entropy solution is always in favour of outcome 0, in 
comparison with the weighted average (P(y = 0|x) > Score{0)). 
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Reliability of expert 3 Opinion of expert 3 



Fig. 1. Two examples of simulation of the fusion rule (see Sections 4.1 and 4.2 for 
details). We display the results of the three combination rules, maximum entropy, 
weighted average and linear programming in terms of (a) the reliability of expert 3 (b) 
expert 3’s opinion. 



4.2 Second Set of Simulations 

In this second example, we set, for experts’ opinions, 7r(d(l) = 0|x) = 0.85, 
7r(d(l) = l|x) = 0.15, 7r((i(2) = 0|x) = 0.8, 7r(<i(2) = l|x) = 0.2, 7r((i(3) = 0|x) = 
z, 7r(d(3) = l|x) = (1 — z) and, for the reliability, Zi(l) = 0.7, A(2) = 0.75, 
Z\(3) = 0.8. The results are shown in Figure 1 (b), where we display p(y = 0|x) 
and Score{0) in terms of the opinion of expert 3, i.e. z. 

We observe that the weighted average rule is linear in the opinion of expert 
3, while maxent is nonlinear. Below the value of z = 0.35, we see that the 
constraints are incompatible; in this case, the procedure described in 3.4 (linear 
programming) is used in order to compute the aggregation score. 

5 Conclusion 

We introduced a new way of combining experts opinions or classifiers outputs. 
It is based on the maximum entropy framework; maximum entropy seeks the 
joint probability density of a set of random variables that has maximum entropy 
while satisfying the constraints, i.e. it does not introduce any “additional ad hoc 
information” . It is therefore the “most honest” characterization of our knowledge 
given the available facts. The available knowledge is expressed through a set of 
linear constraints on the joint density including a measure of reliability of the 
experts, and of the correlation between them. This way, the different constraints 
(representing the a priori knowledge about the problem) are incorporated within 
a single measure. 

Iterative mathematical programming methods are used in order to compute 
the maximum entropy, with guaranteed convergence to the global maximum if, of 
course, there is a feasible solution. If there is a conflict between the available facts, 
i.e. between the constraints, so that there is no feasible solution, we could still 
compute the solution that is “closest” in some way to the constraints satisfaction. 
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However, even if this approach seems promising, it has not been evaluated 
in the context of classifiers combination. Further work will thus be devoted to 
the experimental comparison with more standard techniques such as those that 
were mentioned in the introduction. We therefore cannot currently provide any 
conclusion about the applicability (and performances) of this approach. 
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Abstract. In the paper a new method for handling with missing fea- 
tures values in classification is presented. The presented idea is to form 
an ensemble of one-class classifiers trained on each feature, preselected 
group of features or to compute from features a dissimilarity represen- 
tation. Thus when any feature values are missing for a data point to 
be labeled, the ensemble can still make a reasonable decision based on 
the remaining classifiers. With the comparison to standard algorithms 
that handle with the missing features problem it is possible to build 
an ensemble that can classify test objects with all possible occurrence 
of missing features without retrain a classifier for each combination of 
missing features. Additionally, to train such an ensemble a training set 
does not need to be uncorrupted. The performance of the proposed en- 
semble is compared with standard methods use with missing features 
values problem on several UCI datasets. 



1 Introduction 

The increasing resolution of the sensors increases also the probability that one 
or a group of features can be missing or strongly contaminated by noise. Data 
may contain missing features due to a number of reasons e.g. data collection 
procedure may be imperfect, a sensor gathering information may be distorted 
by unmeasurable effects yielding the loss of data. Several ways of dealing with 
missing feature values have been proposed. The most simple and straightfor- 
ward is to ignore all missing features and use the remaining observations to 
design a classifier [1]. Other group of methods estimates values of the missing 
features from available data by: replacing missing feature values by e.g. their 
means estimated on a training set [2], [3]. Morin [4] proposed to replace miss- 
ing feature values by values of these features from their nearest neighbors, in 
the available, lower dimensional space, from the training set. [5, 4, 1] described 
different solutions using the linear regression to estimate substitutes for missing 
features values. However, Little [5] showed that many such methods are incon- 
sistent, i.e. discriminant functions designed form the completed training set do 
not converge to the optimal limit discriminant functions as sample size tends to 
infinity. At this moment, methods recommended as generally best are based on 
EM algorithm [6, 1]. 
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However, because of the complexity of existing methods, neither of them can 
provides a solution for all cases with missing features, that can occur during 
classification. When an corrupted test point occurs a classifier is retrained or 
missing features is replaced by estimated ones which can lead to worse results 
than when the classification decision is based just on existing features [5,1]. 
Additionally, in most of the proposed solutions to the missing feature problem 
it is assumed that training data is uncorrupted, thus potentially valuable data 
are neglected during training. 

In this paper several techniques based on combining one-class classifiers [7] 
are introduced to handle missing feature. The classifiers are trained on one- 
dimensional problems, n-dimensional problem or features are combined in dis- 
similarity representations. The presented method can coupe with all possible 
situation of missing data, from single one to — 1, without retraining a classi- 
fier. It also makes use of corrupted data available for training. 

The layout of this paper is as follows: in section 2, the problem of missing 
feature values and combining one-class classifiers (occs) is addressed. In section 
2.1 some possibility of combining one-class classifiers to handle missing features 
problem are discussed. Section 3 shows results on UCI datasets and discusses 
the relative merits and disadvantages of combining occs. Section 4 presents the 
discussion and conclusions. 

2 Formal Framework 

Suppose two sets of data are given: a training set 

£ = {(x„,y^) : XtoGRP™; m = 

and a test set 

T = {xt :xtGlR'“; where (p™,qt)GR^. 

Where x-s represent objects and y-s represent labels^. A is a number of 
all the features considered in a classification task. Each object x in £ or T can 
reside in the different space. Even if and Xm2 are represented in spaces of 
the same dimensionality ||_Pmi|| = ||Pm 2 l|) the present features might be different 
Pnii 7 ^ Pni 2 ■ Such a problem is called a classification with missing data in training 
and test sets. 

Suppose a classifier is designed by using uncorrupted data. Assume that input 
(test) data are then corrupted in particularly known ways. How to classify such 
corrupted inputs to obtain the minimal error? For example, consider a classifier 
for data with two features, such that one of the features is missing for a particular 
object X to be classified. Fig. 1 illustrates a three-class problem, where for the 
test object x the feature /i is missing. The measured value of /2 for x is x/j. 
Clearly, if we assume that the missing value can be substituted by the mean(fi), 

It is assumed that both the training set £, and the test set T are corrupted. 



1 
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X will be classified as t/ 2 - However, if the priors are equal, j /3 would be a better 
decision, because p{xf^\yz), estimated on the training set is the largest of the 
three likelihoods. In terms of a set of existing features F, the posteriors are [ 8 ]: 




P{y^\F) = 



/g.(F) p(F)df_ 
Jp{F) di_ 



( 1 ) 



where f_ indicates the missing fea- 
tures, gi(F) = P(j/i|F,f_) is the con- 
ditional probability from a classifier. 
In short, equation (I) presents inte- 
grated, marginalization of the poste- 
rior probability over the missing fea- 
tures. 

Several attempts were made to es- 
timate missing feature values for a test 
object [ 1 ] e.g. by: 

- solving a classification problem in 
an available lower-dimensional feature 
space F (obviously in this case no estimation of the missing data is required); 

- replacing missing feature values in T by the means of known values from C; 

- replacing missing values in T by values from the nearest neighbor from £; 

- using the expectation-maximization algorithm to maximize e.g. the class pos- 
teriors. This method is the most complicated one and the assumption about 
underlaying data distribution has to be made. 

In further experiments the first and the third methods mentioned above are 
used as a comparison to the proposed methods. 



Fig. 1. Class conditional distributions for 
a three-class problem. If a test point misses 
the feature value from /i the optimal 
classification decision will be 1/3 because 
pixf^lys) (estimated on the training set) 
is the largest. 



2.1 Combining One-Class Classifiers 

In the problem of one-class classification the goal is to accurately describe one 
class of objects, called the target class, as opposed to a wide range of other objects 
which are not of interest, called outliers. Many standard pattern recognition 
methods are not well equipped to handle this type of problem; they require 
complete descriptions for both classes. Especially when one class is very diverse 
and ill-sampled, usually (two-class) classifiers yield a very bad generalization for 
this class. Various methods have been developed to make such a data description 
[7]. In most cases, the probability density of the target set is modeled. This 
requires a large number of samples to overcome the curse of dimensionality [9] . 

Since during a training stage it is assumed that only target objects maybe 
present, a threshold is set on tails of the estimated probability or distance d 
such that a specified amount of the target data is rejected, e.g. 0.1. Then in the 
test stage, the estimated distances d can be transformed to resemble posterior 
probabilities as follow p{y\x) = for the target class and 1 — p{y\x) for the 

outlier class. 
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Fig. 2. A multi-class problem solved by: (left) combining two-class classifiers (one-vs- 
all approach), (right) combining one-class classifiers by the maximum resemblance to 
a model. 



Fig. 2 illustrates dilTerences between the solution to multi-class problem by 
combining two-class classifiers (one-vs-all approach) [9] and combining one class 
classifiers [7]. In the first approach, the entire data space is divided into parts 
being assigned to a particular class. A new object x has to be classified to one of 
the classes present in the training set. It means in a case of outliers the classifica- 
tion is ironies. In addition in one-vs-all or pairwise combining approach one has 
to compensate imbalance problem by e.g. settings probabilities to appropriate 
levels. 

The right in Fig. 2 plot shows the occs combined by max rule. This means 
that in order to handle a multi-class problem, occs can be combined by the 
max rule or by a train combiner. In this approach, one assigns a new data 
point only to the particular class if it is in one of the described domains. If a 
new object x lies outside a region described by the target class, it is assigned 
to the outlier class. In the combination of two-class classifiers it appears that 
often the more robust mean combination rule is to be preferred. Here extreme 
posterior probability estimates are averaged out. In one-class classification only 
the target class is modeled P(x|o;t’,,) and a low uniform distribution is assumed 
for outlier class. This makes this classification problem asymmetric and extreme 
target class estimates are not canceled by extreme outlier estimates. However, 
the mean combination covers a broad domain in feature space [10], while the 
product rule has restricted range. Especially in high dimensional spaces this 
extra area will cover a large volume and potentially a large number of outliers. 



2.2 Proposed Method 

In this paper we propose several methods based on combing one-class classifiers 
to handle the missing features problem. Our goal is to build such an ensemble 
that dose not required retraining of a classifier for every combination of missing 
data and at the same time minimizes number of classifiers that has to be con- 
sidered. In this section we will describe several ways of combining occs and some 
possibilities for the based classifiers. 
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First, two-class classifiers, combined like in one-vs-all method are considered; 
Fig. 2 (left) trained on all possible combination of missing feature values. In such 
case the number of base two-class classifiers that has to be trained is Kc,n = 
{2^ — 1) • , where N is the number of features and C is the number 

of classes. Since all the features cannot be missing 1 is subtracted from all 2^ 
possibilities. For a problem with ten features and two classes, fF 2 ,io = 1023 
and for 20 features, ^ 2^20 = 1048575. For such simple problems the number of 
classifier is already quite large. 

On the other hand, if one-class classifiers are trained on all possible combi- 
nation of missing features than the number of possibilities reduces to Kc,n = 
{2^ — 1) ■ C and the classification regions do not longer are considered as open 
spaces. Fig. 2 (right). However, for a large number of features this is a quite 
complicated study, since the number of classifiers is still cumbersome to handle 
and the system is difficult to validate. 

In this paper, one of the proposed methods is to use one-class classifiers as 
base classifiers to combine, trained on one-dimensional problems and combine 
by fix combining rules: mean, product, max etc.. This reduces the number of 
classifiers that has to be in the pool as a combining possibilities to N ■ C for the 
fixed combining rules 7^2,20 = 40. 

Below the way how to use fix (mean, product, and max) combining rules 
applied to the missing feature values problem in multi-class problems are de- 
scribed. 



Mean combining rule: y(x\ojT) = argmaxc 






Product combining rule: j/(x|u;t) = argmaxc 






Max combining rule: y(x|wT) = argmaxc 



max* P(xi|wTc) 



where P{xi\LOTc) is a probability that object x belongs to the target class C and 
N' is the number of available features. The probabilities P{xi\coTc) estimated on 
single features are combined by fix rules. The test object x is classified to the class 
C with the maximum resemblance to it. However, because a single feature Xi 
is considered at time during classification the feature interactions are neglected. 
This can lower the performance of the proposed ensemble. This problem will be 
addressed in the section 3.2 of this paper. 



Combining Dissimilarity Based Representations. The second method 
that is proposed in this paper, to handle missing features, is to combine non- 
missing features in the dissimilarity measure [11]. In this case, instead of training 
a classifier in the N' dimensional feature space it is trained in a dissimilarity 
space. In our experiments the sigmoid transformation from distances to dissim- 
ilarities is used: 



N' 



= 7 ^ E 



l + exp(-^) 



- 1 



where 




(2) 
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where ddjk is the computed dissimilarity between object j and k and djki is the 
Euclidean distance between those two objects. To increase the robustness in the 
case of missing features dissimilarities were averaged out over one-dimensional 
representations, at is approximated by the average distance between training 
objects considered the single feature i at time. 

3 Experiments 

In the experiments as the base classifier a simple Parzen classifier was used with 
the threshold set to reject 0.05 of target objects in the training set [7]. The 
smoothing parameter was optimized by leave-one-out approach [12]. Linear Pro- 
gramming Dissimilarity-data Description (LPDD) [13] (dist) was used as the 
base classifier for combining dissimilarity representations with 0.05 of the target 
objects rejection rate. To combine classifiers resemblances to the target class 
y{x\u!T) were transformed to posterior probabilities by the sigmoid transforma- 
tion . The fixed (mean, product and max) [14] combining rules 

are applied to posterior probabilities computed from the resemblance on single 
features. 

The proposed methods were compared to two standard methods designed 
to handle missing data: training classifier in a lower, available feature space 
(lower) and replacing missing features in a test object from T by the features of 
their nearest neighbor from a training set C {inn)- The experiments were carried 
out on some of UCI datasets [15]: WBCD - Wisconsin Breast Cancer Dataset 
(number of classes c = 2, number of features k = 9), MFEAT (c=10, k=649), 
CBANDS (c=24, k=30), DERMATOLOGY (c=6, k=34), SATELLITE (c=6, 
k=36). The total number of features in MFEAT dataset was reduced from the 
original number of 649 to 100 (MFEAT 100) and 10 (MFEAT 10) by a forward 
feature selection based on maximization of the Fisher criterion: the trace of 
ratio of the within- and between-scatter matrices J = tr{S^ S b} [16], to avoid 
the curse of dimensionality. The datasets were spited randomly into the equally 
sized training and test sets. For each percent of missing features ([0:10:90]%) 
ten training sets and for each training set ten test sets were randomly generated 
10 X 10. 

3.1 Combining occs Trained on Single Features 

In this section the ensemble built from classifiers trained on individual features 
are evaluated. It is assumed that each feature contributes similar, independent 
amount of information to the classification problem. Any interactions between 
features are neglected. 

In Fig. 3 mean errors for different solution to the missing features problem for 
different multi-class problem are presented. The classifiers are trained on one- 
dimensional problems and combined by fix combining rules. In dism method 
corespondent dissimilarities are computed and LPDD is trained on all one-class 
classification problems. The results are compared with two standard methods 
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Fig. 3. Mean error for different percent of missing features for the combiners trained on 
single features for various combining rules: (mean, product, max), dism - dissimilarity 
representation LPDD. lower - the Parzen classiher trained on all available features, fnn 
- the Parzen classifier trained on available features plus features from nearest neighbor 
from a training set. The results are averaged over 10 x 10 times; see text for details. 



for missing features problem: lower - a classifier is trained on all available fea- 
tures neglecting missing features and f„„ missing feature values are replaced by 
features from the nearest neighbor of the test object in the training set. It can 
be observed that mean and product rule are performing the best for the entire 
range of missing features. It depends on the dataset which of this fix combin- 
ing rules is better. The dissimilarity representation does not perform well, apart 
from WBCD, for which for a small percent of missing features the performance 
is comparable with fix combiners. The reason is that the computed dissimilar- 
ities on the training set, on all the features, are not resemble to dissimilarities 
computed on the test set with missing features. However, the dism method out- 
performs the standard f„„ method. The reasons for such poor performance of 
the f„„ method is that if more features are missing replacing them by features 
from the training set will cause less differences between test objects. The single 
classifier trained on all available features performs the best on the CBANDS 
dataset however is outperformed in other problems by fix combining rules. It 
can be concluded that more complicated problems split in simple ones and then 
combine can outperform a single, big classifier [17, 18]. 
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Fig. 4. Mean error for different percent of missing featnres for the combiners trained on 
(n-|-l) features for varions combining rules: (mean, product, max), dism - dissimilarity 
representation LPDD. lower - the Parzen classiher trained on all available features. f„„ 
- the Parzen classiher trained on available features plus features from nearest neighbor 
from a training set. The results are averaged over 10 x 10 times; see text for details. 



3.2 Combining occs Trained on (n + 1) Features 

In the previous section it was assumed that every feature contributes a similar, 
independent amount of information to the classification problem. In this section 
we will study a possibility when a fixed number of features is always present 
or when there is a certain subset of features without which the classification is 
almost random e.g. for medical data like: name, age, height,..., examinations. It is 
probably possible to classify a patient to the healthy/unhealthy group without a 
name or age provided, but not without specific examination measurements. One 
of the possible solutions is to use a weighted combining rule [19]. 

In this paper, a different approach is proposed. Let us assume that the same 
n features are always present for the test objects. Therefore, instead of N pos- 
sible missing features we have N — n possibilities. In this case, we propose to 
train {N — n — 1) base one-class classifiers in a (n -I- l)-dimensional space. As a 
result, the base classifiers are highly depend. According to common knowledge 
on combining classifiers [14, 20], combining is beneficial when base classifiers dif- 
fer. However, in our case, there is a trade-off between the number n of common 
features and how well the posterior probabilities are estimated. In Fig. 4, the 
mean error for n = 3 for WBCD and MFEAT 10 is shown. The standard devi- 
ation varies between 1-2% from the mean value. The classifiers are trained on 
(n-|-l) features and then combined. Compared to the results showed in Fig. 3 
the performance of fix combiners increases. The posterior probabilities are better 
estimated and some features dependencies are also included in the estimation. If 
an additional knowledge is available about a classification problem e.g. n features 
are always present by appropriate combining better classification performance 
can be achieved. 
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3.3 Small Sample Size Problem 

In this section the performance of the proposed method are evaluated for small 
sample size problems [9]. In small sample size problems the number of objects 
per class is similar or smaller than the number of features. 



MFEAT 100 




SATELLITE 




% of missing features 



Fig. 5. Small sample size problems. Mean error for different percent of missing features 
for the combiners trained on single features for various combining rules: (mean, product, 
max), dism - dissimilarity representation LPDD. lower - the Parzen classifier trained 
on all available features, fnn - the Parzen classifier trained on available features plus 
features from nearest neighbor from a training set. The results are averaged over 10 x 10 
times; see text for details. 

Fig. 5 shows the mean error for two small sample size problems. Because the 
probabilities are estimated on single feature the proposed method is robust to 
small sample size problems. The classifier statistics are better estimated and the 
constructed ensemble is robust against noise. 

4 Conclusions 

In this paper, several methods for handling missing feature values have been 
proposed. The presented methods are based on combining one-class classifiers 
trained on one-dimensional or (n-|-l) dimensional problems. Additionally, the 
dissimilarity based method is proposed to handle the missing features problem. 
Compared to the standard methods, our methods are much more flexible, since 
they require much less classifiers to consider and do not require to retrain the sys- 
tem for each new situation when missing feature values occur. Additionally, our 
method is robust to small sample size problems due to splitting the classification 
problem to N several smaller ones. 
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Abstract. In this paper we describe new methods to built a kernel ma- 
trix from a collection of kernels for classification purposes using Support 
Vector Machines (SVMs). The methods build the combination by quan- 
tifying, relative to the classihcation labels, the difference of information 
among the kernels. The proposed techniques have been successfully eval- 
uated on a variety of artihcial and real data sets. 



1 Introduction 

Support Vector Machines (SVMs) have proven to be a successful tool for the 
solution of a wide range of classification problems since their introduction in [3] . 
The method uses as a primary source of information a kernel matrix K(i,j), 
where K is Mercer’s kernel and i,j represent data points in the sample. By 
the representer theorem (see for instance [16]), SVM classifiers always take the 
form f{x) = '^^aiK{x,i). The approximation and generalization capacity of 
the SVM is determined by the choice of the kernel K [4]. A common way to 
obtain SVM kernels is to consider a linear differential operator D, and choose K 
as Green’s function for the operator D*D, where D* is the adjoint operator of D 
[15]. It is easy to show that ||/||^ = || [9]. Thus we are imposing smoothing 

conditions on the solution /. However, it is hard to know in advance which par- 
ticular smoothing conditions to impose for a given data set. Fortunately, kernels 
are straightforwardly related to similarity (or equivalently distance) measures, 
and this information is actually available in many data analysis problems. In ad- 
dition, working with kernels avoids the need to explicitely work with Euclidean 
coordinates. This is particularly useful for data sets involving strings, trees, mi- 
croarrays or text data sets, for instance. 

Nevertheless, using a single kernel may be not enough to solve accurately the 
problem under consideration. This happens, for instance, when dealing with text 
mining problems, where analysis results may vary depending on the document 
similarity measure chosen [8]. Thus, information provided by a single similarity 
measure (kernel) may be not enough for classification purposes, and the combi- 
nation of kernels appears as an interesting alternative to the choice of the ‘best’ 
kernel. 
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The specific literature on the combination of kernels is rather in its begin- 
nings. A natural approach is to consider linear combinations of kernels. This is 
the approach followed in [10], and is based on the solution of a semi-definite pro- 
gramming problem to calculate the coefficients of the linear combination. The 
solution of this kind of optimization problems is computationally very expensive 
[19] and, therefore, out of the scope of this paper. In addition, there is no reason 
why a linear combination should provide optimal results. Another approach is 
proposed in [2]. The method, called MARK, builds a classifier (not the specific 
kernel matrix) by a boosting type algorithm. Classification results shown in that 
work are very similar to those obtained with a SVM classifier. 

In this paper we describe three methods to build a kernel matrix from a 
collection of kernels for classification purposes. We provide a general scheme for 
combining the available kernels. Our methods build the combination by quan- 
tifying, relative to the classification labels, the difference of information among 
the kernels. 

The paper is organized as follows. Section 2 describes the proposed methods 
for combining kernels. The experimental setup and results on artificial and real 
data sets are described in Section 3. Section 4 concludes. 

2 Methods 

Let Ki, K 2 , • ■ ■ , Km be a set of m input kernels defined on a data set, and 
denote by K* the desired output combination. We first concentrate on binary 
classification problems (m = 2) and then we will extend the method to the case 
m > 2. Let y denote the label vector, where yt G { — 1, +1}. 

To motivate the discussion, consider the average of the kernels involved: 

K* = ^{K,+K2). ( 1 ) 

There is a straightforward similarity between (1) and the first term of the stan- 
dard decomposition of a matrix into its symmetric and skew-symmetric parts: 
K* = ^{K* +K*’^) + ^{K* — If K* is a symmetric matrix, then K* = 

and the skew-symmetric part will vanish. By analogy we will derive kernel com- 
binations of the form 



K* = i(Ki + K2) + /(Ki-K2), (2) 

such that if Ki and K 2 tend to produce the same classification results, then 
f{Ki — K 2 ) becomes meaningless and (1) yields K* ~ Ki ~ K 2 - The term 
f{Ki — K 2 ) will quantify the difference of information between kernels Ki and 
K 2 for the aim of classification. 

2.1 The Absolute Value (AV) Method 

Consider the matrix Y = diag{y), whose diagonal entries are the yi labels. Our 
first proposal builds K* through the formula: 
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K* = ^{Ki+K2) + tY\K^-K2\Y, (3) 

where r is a positive constant to control the weight given to the term f{Ki—K 2 ). 
Each element K*{i,j) takes the form 

+ K2{i,j)) + Ty^yj\Ki{i,j) - K2{i,j)\ . (4) 

The intuition underlying the method is next justified. Consider two data points 
i and j in the same class {yiyj = 1). Taking into account that max(a;,y) = 
l/2(a; + y) + l/2|x — y\, it is direct to show that, for r = 1/2, K*{i,j) = 
max(itli(i, j), ^ 2 ( 1 , j)). Analogously, from min(x,?/) = l/2{x + y) — l/2\x — y\, 
if i and j belong to different classes {yiyj = —1), then K*{i,j) = 

-^ 2 (*,/))• Given that K* can be interpreted as a similarity measure, if i and j 
are in the same class, the method guarantees that K*{i,j) will be as large as 
possible. On the other hand, if i and j belong to different classes, a low similarity 
between them can be expected. Hence, the method tends to move closer points 
belonging to the same class, and tends to separate points belonging to different 
classes. In the case of an asymmetric classification problem, this method reduces 
to the pick-out method, presented in [13]. 

Notice that positive definiteness of K* is not guaranteed. Several solutions 
have been proposed to face this problem [14]: A first possibility is to replace 
K* by K* + XI, for A > 0 large enough to make all the eigenvalues of the 
kernel matrix positive. Another direct approach uses Multidimensional Scaling 
to represent the data set in an Euclidean space [7]. Finally it is also possible to 
define a new kernel matrix as K*"’"K* [17]. In practice, there seems not to be a 
universally best method to solve this problem. 



2.2 The Squared Quantity (SQ) Method 

Our second proposal builds each element of K* as: 

= ^(ATi(i, j) -b K2{i,j)) + Ty^yj{Ki{i,j) - ^ 2 ( 1 ,/))^ , (5) 

where t plays the same role as in the AV method. Figure 1 shows different 
choices of f{Ki — K 2 ) for different values of t. The straight lines correspond to 
the AV method. Notice that this is the upper limiting case of the SQ method. 
The lower limiting case is represented by the x-axis line, corresponding to the 
use of 5 (All + K 2 ). In addition, the SQ curves are differentiable everywhere. 

As in the previous case, positive semidefiniteness is not assured and the same 
comments apply. 



2.3 The Squared Matrix (SM) Method 

This method provides a straightforward solution to the lack of positive semidef- 
initeness. While the SQ method works by squaring each element of the K\ — K 2 
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Fig. 1. Different choices of f{Ki — K2) for different values of r. The straight lines 
correspond to the AV method. The curves correspond to the SQ method. 



matrix, the SM method works by considering the continuous operator defined as 
the square of the whole matrix, that is, {Ki — K2){Ki — K2). The combination 
formula now becomes: 

K* = ]^{K^ + K 2 ) + T{Y{Ki-K 2 )f , (6) 

where r plays the same role as in the two previous methods. Given that matrices 
Y{K\ — K2){Ki — K2)Y and {Ki — K2)^ {Kx — K2) have the same eigenvalues, 
the matrix built by the SM method is positive semidefinite. Hence, this method 
provides a matrix K* arising from a Mercer’s kernel. 

2.4 Combining More than Two Kernels 

Proceeding in a recursive way, the extension of the methods to the combination 
of more than two kernels is straightforward. For the sake of simplicity, consider 
the case m = 3. Let K2 and be the considered kernel matrices, and 
K*2 the resulting kernel obtained by using one of the preceding methods with 
Ki and K2 as input matrices. The final solution K* will be obtained via the 
combination of K^2 and K^. 

The order in which kernels are given to the recursive procedure is a current 
research issue. For instance, the order induced by the classification capacity of 
each kernel could be considered. In this paper, the classification capacity has 
been measured in terms of the classification success rates of each kernel. 

3 Experiments 

Next we test the performance of the previous methods on both artificial and 
real data sets. To check the performance of the methods described above, once 
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Fig. 2. Artificial data. Two groups with different scattering matrices. 



the kernel matrix K* has been contructed, it is used to train a SVM. For the 
AV, SQ and SM methods the value of the parameter t has been assigned via 
cross-validation . 

Given a non labelled data point x, K{x,Xi) has to be evaluated. We can 
calculate two different values for K{x,Xi), the first one assuming x belongs to 
class -1-1 and the second assuming x belongs to class —1. For each assumption, all 
we have to do is to compute the distance between x and the SVM hyperplane and 
assign x to the class corresponding to the largest distance from the hyperplane. 

In the following, for all the data sets, we will use 70% of the data for training 
and 30% for testing. 

We have compared the proposed methods with the following classifiers: Mul- 
tivariate additive regression splines (MARS) [5], Logistic Regression (LR), Lin- 
ear Discriminant Analysis (LDA), k-nearest neighbour classification (KNN), the 
MARK combining method [2] and SVMs using a RBF kernel Kc{xi,Xj) = 
^-\\xi-xj\\ /c^ with c = O.bd, where d is the data dimension (see [18]). 



3.1 Artificial Data Sets 

Two Groups with Different Scattering Matrices. This data set, shown 
in Figure 2, is made up by 1200 points in IR^, divided in two groups. Each 
group corresponds to a normal cloud with different covariance matrices. Overlap 
between the two groups is apparent. We have defined two RBF kernels Ki and 
K 2 , where for each kernel, the constant c has been chosen as the average of the 
squared Euclidean distances within each group. Results on this experiment are 
shown in Table 1. 

The new proposed methods based on the use kernel combinations achieve 
similar results. The decision function for the AV method involves significantly 
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Table 1. Classification errors for the two gronps with different scattering matrices. 



Method 


Train error 


Test error 


Support vectors 


l/2(Ki -f K 2 ) 


1.1 % 


1.2 % 


12.9 % 


AV 


1.1 % 


1.9 % 


6.0 % 


SQ 


1.1 % 


1.3 % 


12.9 % 


SM 


1.1 % 


1.3 % 


28.6 % 


K-NN 


1.5 % 


1.4 % 




LDA 


2.5 % 


1.9 % 




LR 


2.6 % 


2.6 % 




MARK 


0.7 % 


7.7 % 


39.9% 


MARS 


2.6 % 


2.6 % 




SVM 


1.1 % 


1.3 % 


28.2 % 



Table 2. Classification errors for the two groups with different scattering matrices 
using three kernels. 



Method 


Train error 


Test error 


Support vectors 


l/2(Ai -f K 2 ) 


1.2 % 


1.3 % 


17.4 % 


AV 


1.7 % 


1.8 % 


7.5 % 


SQ 


1.1 % 


1.3 % 


20.8 % 


SM 


1.0 % 


1.3 % 


55.3 % 



less support vectors than the other methods. Notice that kernels Ki and K 2 are 
not specially well suited for the problem at hand. However, their performance is 
quite successful. 

To end with this data set, we check the performance of the proposed four 
combination methods when three kernels are used as input. We will combine the 
two kernels previously described and a RBF kernel with c = 0.5d. Results are 
shown in Table 2. Since previous results are very close to the true theoretical 
error, there is not much room for improvement and similar results are obtained. 

Two Kernels with Complementary Information. This data set consists 
of 400 two-dimensional points (200 per class). Each group corresponds to a 
normal cloud with mean and diagonal covariance matrix afl. Here /ii = (3, 3), 
^2 = (5,5), (Ti = 0.7 and (T 2 = —0.9. We have defined two kernels from the 
projections of the data set onto the coordinate axes. The point in this example 
is that, separately, both kernels achieve a poor result (a test error of 15%). 
Table 3 shows the successful results obtained with the combination methods. In 
particular, the SM method attains the best overall results. 

3.2 Real Data Sets 

Cancer Data Set. In this section we have dealt with a database from the 
UCI Machine Learning Repository: the Breast Cancer data set [11]. The data 
set consists of 683 observations with 9 features each. For this data set we have 
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Table 3. Classification errors for the kernels with complementary information. 



Method 


Train error 


Test error 


Support vectors 


l/2(Ai -t K 2 ) 


2.3 % 


2.7 % 


9.2 % 


AV 


2.5 % 


4.0 % 


5.8 % 


SQ 


0.0 % 


2.8 % 


35.6 % 


SM 


5.4 % 


2.2 % 


20.3 % 


K-NN 


3.4 % 


3.8 % 




LDA 


7.5 % 


7.8 % 




LR 


7.6 % 


7.7 % 




MARK 


7.7 % 


8.0 % 


2.0% 


MARS 


3.4 % 


3.9 % 




SVM 


3.4 % 


3.6 % 


26.3 % 



Table 4. Classification errors for the cancer data. 



Method 


Train error 


Test error 


Support vectors 


l/2(Ki +K 2 ) 


0.0 % 


3.4 % 


11.4 % 


AV 


2.9 % 


3.4 % 


6.1 % 


SQ 


2.1 % 


2.9 % 


9.6 % 


SM 


0.0 % 


3.9 % 


11.2 % 


K-NN 


3.9% 


4.8 % 




LDA 


3.8 % 


4.4 % 




LR 


13.1 % 


13.1 % 




MARK 


0.0 % 


4.4 % 


18.3% 


MARS 


2.7 % 


3.2 % 




SVM 


0.0 % 


4.4 % 


49.5 % 



defined two input kernels similar to those in the first example, that is, two RBF 
kernels Ki and K^, where for each kernel, the constant c has been chosen as 
the average of the squared Euclidean distances within each group. Results are 
shown in Table 4. 

The SQ method shows the best overall performance. Once again the new 
combination methods provide better results than the SVM with a single kernel, 
using significantly less support vectors. 



A Handwritten Digit Recognition Problem. The experiment in this sec- 
tion is a binary classification problem: the recognition of digits ‘3’ and ‘9’ from 
the Alpaydin and Kaynak database [1]. The data set is made up by 1134 records, 
represented by 32 x 32 binary images. We have employed two different methods 
to specify features in order to describe the images. The first one is the 4x4 
method: features are defined as the number of ones in each of the 64 squares of 
dimension 4x4. The second method was introduced by Frey and Slate [6]: 16 at- 
tributes are derived from the image, related to the horizontal/ vertical position, 
width, height, etc. This is a typical example with two sources of information 
possibly different and perhaps complementary. We use these representations to 
calculate two kernels from the Euclidean distance. The classification performance 
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Table 5. Classification errors for the handwritten digit data set. 



Method 


Train error 


Test error 


Support vectors 


l/2(Ai + K 2 ) 


0.0 % 


0.8 % 


4.9 % 


AV 


0.8 % 


0.3 % 


13.9 % 


SQ 


0.0 % 


0.8 % 


32.6 % 


SM 


0.0 % 


0.6 % 


7.4 % 


K-NN(4 X 4) 


0.0 % 


1.7 % 




K-NN(Frey-Slate) 


0.0% 


21.5 % 




LDA(4 X 4) 


0.6 % 


1.1 % 




LDA(Frey-Slate) 


3.2 % 


2.2 % 




LR(4 X 4) 


0.0 % 


1.7 % 




LR(Frey-Slate) 


2.6 % 


2.5 % 




MARK 


0.0 % 


1.4 % 


13.0 % 


MARS(4 X 4) 


0.8 % 


1.1 % 




MARS(Frey-Slate) 


1.9 % 


3.3 % 




SVM(4 X 4) 


0.0 % 


1.1 % 


4.0 % 


SVM(Frey-Slate) 


6.4 % 


6.6 % 


12.2 % 



for all the methods is tabulated in Table 5. In this case the AV and SM methods 
achieve the best overall performance. Our methods based on kernel combinations 
improve the results obtained using the rest of the techniques. 

3.3 A Text Data Base 

To check the methods in a high dimensional setting, we will work on a small text 
data base with two groups of documents. The first class is made up of 296 records 
from the LISA data base, with the common topic ‘library science’. The second 
class contains 394 records on ‘pattern recognition’ from the INSPEC data base. 
There is a mild overlap between the two classes, due to records dealing with 
‘automatic abstracting’ . We select terms that occur in at least 10 documents 
(obtaining 982 terms). Labels are assigned to terms by voting on the classes of 
documents in which these terms appear. The task is to correctly predict the 
class of each term. Following [12], we have defined the kernel K\ by Ki{i,j) = 
= Slfc I jj^e^sures the number of documents indexed 

by term z, and \xi Ax j \ the number of documents indexed by both i and j terms. 
Similarly, K 2 = . The task is to classify the database terms using the 

information provided by both kernels. Note that we are dealing with about 1000 
points in 600 dimensions, and this is a near empty set. This means that it will be 
very easy to find a hyperplane that divides the two classes. Notwithstanding, the 
example is still useful to guess the relative performance of the proposed methods. 
Following the scheme of the preceding examples. Table 6 shows the results. 

Our proposal of methods for the combination of kernels clearly outperform 
the rest of the methods. In particular, the AV method achieves the best perfor- 
mance. Notice that this is the most difficult example in this section and the one 
where the advantage of the combination kernel methods is more evident. 
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Table 6. Classification errors for the term data base. indicates non convergence of 
the method. 



Method 


Train error 


Test error 


Support vectors 


l/2(Ki + K 2 ) 


0.0 % 


1.8 % 


19.6% 


AV 


0.0 % 


1.1 % 


8.3 % 


SQ 


0.0 % 


3.0 % 


33.6 % 


SM 


0.0 % 


1.8 % 


16.8 % 


K-NN 


12.8 % 


14.0 % 




LDA 


0.0 % 


31.4 % 




LR 


-% 


-% 




MARK 


-% 


- % 


% 


MARS 


-% 


- % 




SVM 


23.8 % 


23.9 % 


63.2 % 



4 Conclusions 

In this work we have proposed three new techniques for the combination of 
kernels within the context of SVM classifiers. This is an interesting approach 
to combine classifiers, since domain knowledge is often available in the form 
of kernel information. The suggested methods compare favorably to other well 
established techniques and in the comparison with other combining techniques in 
a variety of artificial and real data sets. Within the group of new kernel combining 
techniques proposed in this paper, there is not an overall better method. Further 
research will focus on the theoretical properties of the methods and extensions. 

Acknowledgments 

This research has been partially supported by grants TIC2003-05982-C05-05 
from MCyT and PPR-2003-42 from Universidad Rey Juan Carlos, Spain. 



References 

1. E. Alpaydin and C. Kaynak. Cascading Classifiers. Kybernetika 34 (4) 369-374, 
1998. 

2. K. Bennett, M. Momma, and J. Embrechts. MARK: A Boosting Algorithm for 
Heterogeneous Kernel Models. Proceedings of SIGKDD International Conference 
on Knowledge Discovery and Data Mining, 2002. 

3. C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20:273-297, 
1995. 

4. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. 
Cambridge University Press, 2000. 

5. J. Friedman. Multivariate adaptative regression splines (with discussion). Annals 
of Statistics, vol. 19, no. 1, 1-141, 1991. 

6. P.W. Frey and D.J. Slate. Letter Reeognition Using Holland-Style Adaptive Clas- 
sifiers. Machine Learning, 6 (2) 161-182. 





Combining Kernel Information for Support Vector Classification 



111 



7. L. Goldfarb. A unified approach to pattern recognition. Pattern Recognition, 17 
(1984) 575-582. 

8. T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 
2002 . 

9. M.I. Jordan. Advanced Topics in Learning & Decision Making. Conrse material 
available at www.cs.berkeley.edn/~jordan/courses/281B-spring01. 

10. G.R.G. Lanckriet, N. Cristianini, P. Barlett, L. El Ghaoui and M.I. Jordan. Learn- 
ing the kernel matrix with semi-definite programming. Proc. 19th Int Conf Machine 
Learning, pp. 323-330, 2002. 

11. O.L. Mangasarian and W.H. Wolberg. Cancer diagnosis via linear programming. 
SIAM News, Volnme 23, Nnmer 5,1990,1-18. 

12. A. Mnnoz. Compound key word generation from document databases using a hier- 
archical clustering ART model. Intelligent Data Analysis, vol. 1, pp. 25-48, 1997. 

13. A. Munoz, I. Martin de Diego and J.M. Moguerza. Support Vector Machine Clas- 
sifiers for Assymetric Proximities. Proc. ICANN (2003), LNCS, Springer, 217-224. 

14. E. Pekalska, P. Paclfk and R.P.W. Duin. A Generalize Kernel Approach to 
Dissimilarity-based Classification. JMLR, Special Issue on Kernel Methods 2 (2) 
(2002) 175-211. 

15. T. Poggio and F. Girosi. Networks for Approximation and Learning. Proceedings 
of the IEEE, 78(10):1481-1497, 1990. 

16. B. Scholkopf, R. Herbrich, A. Smola and R. Williamson. A Generalized Representer 
Theorem. NeuroCOLT2 TR Series, NC2-TR2000-81, 2000. 

17. B. Scholkopf, S. Mika, C. Burges, P. Knirsch, K. Muller, G. Ratsch and A. Smola. 
Input Space versus Feature Space in Kernel-based Methods. IEEE Transactions on 
Neural Networks 10 (5) (1999) 1000-1017. 

18. B. Scholkopf, J.G. Platt, J. Shawe-Taylor, A.J. Smola and R.G. Williamson. Es- 
timating the Support of a High Dimensional Distribution. Neural Gomputation, 
13(7):1443-1471 , 2001. 

19. L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38(1):49- 
95, 1996. 




Combining Classifiers Using Dependency-Based 
Product Approximation with Bayes Error Rate* 



Hee-Joong Kang 

Division of Computer Engineering, Hansung University 
389 Samsun-dong 3-ga, Sungbuk-gu, Seoul, Korea 
h j kangShansung . ac . kr 



Abstract. Combining classifiers using Bayesian formalism deals with a 
high dimensional probability distribution composed of a class and the de- 
cisions of classifiers. Thus product approximation is needed for the prob- 
ability distribution. Bayes error rate is upper bounded by the conditional 
entropy of the class and decisions, so the upper bound should be mini- 
mized for raising the class discrimination. By considering the dependency 
between class and decisions, dependency-based product approximation 
is proposed in this paper together with its related combination method. 
The proposed method is evaluated with the recognition of unconstrained 
handwritten numerals. 



1 Introduction 

Combining classifiers using Bayesian formalism deals with a high dimensional 
probability distribution composed of a class and the decisions of classifiers. So, it 
is usually hard to compute the high dimensional probability distribution without 
any approximation in real applications. Thus product approximation is needed 
for the probability distribution. On the assumption that the decisions are condi- 
tionally independent of the given class, the high dimensional probability distri- 
bution is approximated with a product of two-dimensional component distribu- 
tions and the decisions are combined with such a product by Xu et al.[l]. This 
assumption can be regarded as the special case of the first-order dependency 
among components. The first-order dependency-based product approximation 
(DBPA) was proposed by Chow and Liu using the measure of closeness criterion 
in [2]. The measure of closeness was devised to measure how close an approx- 
imating distribution was to a true distribution for the product approximation. 
Afterwards, Kang et al. proposed the second-order DBPA scheme in [3] and the 
third-order DBPA scheme in [4] by considering more than the first-order depen- 
dency among components using the same measure of closeness. These DBPAs 
did not have any constraint in dealing with the components to approximate the 
high dimensional probability distribution. 

On the other hand, Bayes error rate is upper bounded by the conditional 
entropy of class and variables, and so the upper bound should be minimized for 

* This research was financially supported by Hansung University in the year of 2004. 



F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 112-121, 2004. 
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raising the class discrimination. By considering the dependency between class 
and variables, another first-order DBPA was proposed by Wang and Wong [5]. 
That is, Wang and Wong defined the class-patterns (CP) mutual information for 
the product approximation, as the patterns were regarded as the variables. That 
is the difference between the product approximation by Wang and Wong and 
the product approximation by Chow and Liu in dealing with the components in 
the probability distribution. Kang and Lee applied the concept of CP mutual 
information to class-decisions (C-D) relationship in combining multiple classi- 
fiers and tried to combine multiple classifiers with the product approximation 
derived from C-D mutual information using Bayesian formalism [6]. The C-D 
mutual information provides theoretical ground so that the high dimensional 
probability distribution is optimally approximated with a product of low dimen- 
sional component distributions according to the order of dependency. Without 
any approximation, direct full dependency between class and decisions was con- 
sidered in the method of Behavior-Knowledge Space (BKS) in [7]. However, 
the BKS method has both the possibility of high rejection rates due to un- 
seen decisions and the exponential complexity in directly storing and estimating 
the high dimensional probability distribution. In this paper, another first-order 
and second-order DBPA schemes based on the results in [6] are proposed with 
Bayes error rate using the C-D mutual information, as the extended work of the 
first-order DBPA by Wang and Wong. The proposed methods are evaluated by 
combining multiple classifiers on the recognition of unconstrained handwritten 
numerals. 

As for recognition experiments, the unconstrained handwritten numerals are 
from Concordia University [8] and the University of California, Irvine (UCI) 
[9]. Totally six classifiers are combined at abstract level, where these classifiers 
were developed by using the features or methodologies in [10, 11]. The Bayesian 
combination methods based on those mentioned DBPAs are introduced in the 
recognition experiments as well as the BKS method. 

This paper is organized as follows. Section 2 explains the product approxima- 
tion schemes using the C-D mutual information with Bayes error rate. Bayesian 
combination using the proposed product approximation schemes is defined in 
Section 3. Experimental results for evaluating the proposed methods with other 
Bayesian combination methods are provided in Section 4 and the concluding 
remarks are given in Section 5. 

2 Product Approximation Scheme 
Using C-D Mutual Information 

A dependency-based approach with Bayes error rate plays an intermediate role 
between the independence assumption and the BKS method in several aspects. 
Considering the dth-order dependency makes the storage needs been {{K — d) ■ 
and it also makes potentially high rejection rates lowered due to the 
product approximation. In other words, the space complexity of the proposed 
approximation scheme, i.e. is from 0{L^) of the first-order dependency 
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to 0(L*-+^) of the BKS method, because {I < d < {K — 1)). The proposed 
approximation scheme also supports several probabilistic combination methods 
from the dependence tree method to the BKS method, according to the order 
of dependency d under permissible resources. 

Heilman and Raviv proved an inequality expression between the Bayes error 
rate Pe and the conditional entropy H{M\C) of class M and variables C, as the 
following Eq. (1) in [12]. This paper regards variables as decisions. The Bayes 
error rate Pe is upper bounded by the conditional entropy H(M\C). Thus, the 
C-D mutual information U{M;C) is defined from the conditional entropy in 
Eq. (1) and measures the degree of dependence between the class M and the 
decisions C, as the following expressions: 

Pe<^H{M\C) = ^{H{M)-U{M;C)) (1) 

= ( 2 ) 

me V / \ / 

where P[{M) is the entropy. Minimizing the upper bound of Pe for class dis- 
crimination leads to maximizing the C-D mutual information U{M;C), since 
the entropy H{M) is constant with regard to C. Thus an optimal product set is 
obtained by maximizing the C-D mutual information. 

2.1 First-Order DBPA 

When K decisions, C\, - ■ ■ , Ck, are combined, a first-order DBPA is obtained 
by considering the first-order dependency among the decision components in the 
probability distribution. The approximating distribution of C is defined in terms 
of two-dimensional distributions as follows: 

K 

= (0<*(j)<j) (3) 

i=i 

and the approximating distribution of C and M is defined in terms of three- 
dimensional distributions as follows: 

K 

Pa{C^,---,CK,M) = \{P{Cr,.\Cn,^.^,M), (Q<l{j)<j) (4) 

i=i 

1 ^ 

P,(Ci, • • • , Ck\M) = n (0 < < J) (5) 

and Crij is conditioned on both and M, and where {ni, ■ • • ,riK) is an 

unknown permutation of integers (1, • • • ,K) and Cq is a null component. And 
P{Cnj\Co, M) is equal to P{Cnj, M), by definition. The first-order dependency 
makes the C-D mutual information expanded like the following expressions by 
using the Eqs. (3)-(5) and dropping the subscript n of C: 
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F(m)F(c) 

= EE P(c, m) log P{c\m) — EE P(c, to) log P(c) 

me me 

K 

= - E logp(TO) + E E E P{c, to) logP(C'j |Ci(j), to) 

m j — ^ c 

K 

c 

K 

= P(M) + ^[/(Q;Q(,-),M) - /(Q;Q(,-))] (6) 

i=i 

i/(M) = — ^ P(to) log P(to) (7) 

m 

AI{Cj ; Q(,.) ) = /(Q ; Q(,.) , M) - J(Q ; Q(,-) ) . (8) 

Thus, it is obvious that the total sum of Delta(Z\) first-order C-D mutual infor- 
mation AI{Cj; Ci{j)) should be maximized. With the dependence tree method 
of Wang and Wong, we can determine both the unknown permutation and its 
conditioned permutation from the chosen optimal dependence tree. The time 
complexity to find the dependence tree is O(nlogn). 

2.2 Second-Order DBPA 

A second-order DBPA is obtained by considering the second-order dependency 
among the decision components in the probability distribution. The approxi- 
mating distribution of C is defined in terms of three-dimensional distributions 
as follows: 



K 

Pa(Ci, • . . , Ck) = n ), (» ^ »2(j) < zl(j) < j) (9) 

i=i 

and the approximating distribution of C and M is defined in terms of four- 
dimensional distributions as follows: 

K 

Pa{Cl,---,CK,M) = J]^P(C„dCn, 2 y).C„.jy,,M). (0 < i2(j) < *l(i)< j) (10) 

J=1 

K 

Pa(Ci,---,CK-|M) = (0 < i2(i) < a(i)< i) (11) 

1=1 

and Crij is conditioned on where (ni, • • • , tik) is an 

unknown permutation of integers (1, • • • , A) and Cq is a null component. Thus 
P{Crij\Co,Co,M) is equal to P(C„^.,M), and P((7„JCo, , M) is equal 
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to P(C„^- M), by definition. The second-order dependency makes the 
C-D mutual information expanded like the following expressions by using the 
Eqs. (9)-(ll) and dropping the subscript n of C: 

C/(A/;C) = y:y:P(c,m)log:t^ 

me ^ ' 

1 ^ 

m,c ^ ' j — 1 

K 

- ^ P(c) log n 

c j^l 

K 

= - X! + X! X! l^*2(i) , QiO) , m) 

m i=l 'm.,c 

K 

~ ^ ^ -P(c) log P(Cj \Ci 2 (j ) , QrO')) 

j=l c 

K 

= H{M) + Ci2{j),Cii{j)jM) — I{Cj; Cj2(j) , Cji(j))] (12) 

i=i 

H{M) = —'^^P{m)logP{m) (13) 

m 

z\/(Q;Ci2a),Qia)) = /(Q;Ci2a),Qia),M) - /(c.sQaoo.Qio-)) ■ (i4) 

From the above derived Eq. (12), maximizing U{M;C) leads to maximizing 
J2f=i ^i 2 {j) 1 C'ii(j)) which is the total sum of Z\ second-order C-D mutual 

information, since remaining entropy term H{M) is also constant with regard to 
C. Then, the next step is how to find an optimal product set by the second-order 
dependency from all the permissible product sets. Finding the optimal product 
set by the second-order dependency is to select the maximum sum of Z\ second- 
order C-D mutual information covering A first-order C-D mutual information, 
as described in the following algorithm. From the found optimal product set, we 
can determine the unknown permutation (rii, • • • ,ux) and their two unknown 
conditioned permutations (ni 2 (i), • ’ ’ > ’^i 2 (/c)) and - ■ ■ ,nni^K))- The time 

complexity of the algorithm is O(n^). 

Input: 

A set of {K + l)-dimensional samples of C and M . 

Output: 

An optimal product set by the second-order dependency as per the A second- 
order C-D mutual information. 

Method: 

1. Estimate two-, three-, and four-dimensional marginal distributions from the 
samples. 
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2. Compute the weights and AI{Cj]Ci 2 (j), Chq)) for all pairs, 

and triplets of classifiers from the estimated marginal distributions. 

3. Compute the maximum weight sum consisted of A first-order and A second- 
order C-D mutual information and find its associated optimal product set, 
as the following statements: 

maxTweight = 0; 

for n = 1 to no. of A first-order C-D mutual information do 
Tweight = weight of the n-th AI{Cj] Cj(j)); 
while ((no. of untraversed classifiers) > 0) do 
choose one of untraversed classifiers and mark it traversed; 
choose the largest permissible A second-order C-D mutual information 
associated with the chosen classifier and one traversed classifier among 
all traversed classifiers; 

Tweight -|-= weight of the chosen AI{Cf, Ci2(j), Qro); 

end 

maxTweight = MAX{maxTweight, Tweight)] 

store maxTweight and its associated A first-order and A second-order 
C-D mutual information; 

end 

obtain maximum maxTweight and its associated A first-order and A 
second-order C-D mutual information; 

By using the systematic approach for product approximation, the order of 
dependency considered can be easily extended to the dth-order under permissible 
computing resources. Considering the dth-order dependency makes the approxi- 
mating distributions Eqs. (9)-(ll) changed as to the order of dependency d. An 
optimal product set by the dth-order dependency consists of one by A first-order 
C-D mutual information, one by A second-order C-D mutual information, ..., 
one hy A (d — l)st-order C-D mutual information, and multiple (i.e. {K — d)) 
component distributions by A dth-order C-D mutual information. 



3 Bayesian Combination Methods 

After the optimal dependence tree by the first-order dependency is found and 
all unknown permutations are determined, a Bayesian decision rule for the com- 
bination is derived from the Bayesian formalism and the optimal product set 
by the first-order dependency. For a hypothesized class m, a supported belief 
function Bel{m) is defined by the following expressions using the Eq. (4): 

Bel{m) = P{m • ,Ck) (15) 

P(Ci, • ■■,CK,m) nf=i P{Cu, , m) 

P(Ci,---,Ck) P(Ci,---,C^) 

K 

« ?7 n ^(Cri,- ICn-c .) , m) 
i=i 



(16) 
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with ?7 as a constant that ensures that 'Yld^i^elirrii) = 1 and (ni,---,nic) 
is an unknown permutation of integers where L is the number of 

classes. Therefore, the combination of classifiers by the first-order dependency 
is to determine a hypothesized class m which maximizes the supported belief 
function Bel{m) in the Eq. (16). 

After an optimal product set by the second-order dependency is found and all 
unknown permutations are determined, Bayesian combination rule is also derived 
from using the Bayesian formalism and the optimal product set by the second- 
order dependency. For a hypothesized class m, its supported belief function 
Bel{m) is defined by the following expressions using the Eqs. (10) and (15): 



Bel{m) = P{m G M|Ci, ■■■,Ck)= 

B[Ci, • • • , Ck) 

P{Ci,---,Ck) 

K 

(17) 

1=1 

with ?7 as a constant that ensures that Belirm) = 1. Therefore, the com- 
bination of classifiers by the second-order dependency is to determine a hypoth- 
esized class TO which maximizes the supported belief function Bel{m) in the 
Eq. (17). Depending on the belief value Bel{m), we can choose a maximized 
posterior probability P*{m G M|Ci, • • • , Ck), and then a combined decision is 
determined or not, according to the decision rule D{C) given below: 

{ mi, if Bel (mi) = max Bel(mi) 

L+1, otherwise . 



4 Experimental Results 

The six classifiers, Ei, E2, E3, E4, E^, Eq, are used for the recognition exper- 
iments of the unconstrained handwritten numerals from Concordia University 
and the University of California, Irvine (UCI). These classifiers were developed 
by using the features in [10, 11] or by using the structural knowledge of numer- 
als, such as bounding, centroid, and the width of horizontal runs or strokes, at 
KAIST and Chonbuk National University. Some of them are back-propagation 
singular or modular neural networks and the others are rule-based modular 
classifiers. Classifiers if 2 and Eq are modular architecture and use directional 
distance distribution and mesh features respectively. Classifiers Ei and Eq are 
singular architecture and use pixel distance function and contour features re- 
spectively. Since classifiers E 4 and Eq were built by the structural knowledge 
obtained from Concordia numerals, they are not relatively good at UCI numer- 
als due to high rejection rates. The details on the handwritten numeral databases 




Combining Classifiers Using Dependency-Based Product Approximation 119 



are in [8] and [9]. The Concordia numerals consist of two training data sets A,B 
and one test data set T. Each data set has 200 digits per class. The UCI samples 
consist of one training data set tra, one cross-validation data set cv, one writer- 
dependent test data set wdep, and one writer-independent test data set windep. 
Each data set has variable number of digits per class. While the data sets cv, 
wdep have about 95 digits per class, the data sets tra, windep have about 185 
digits per class. The neural network classifiers were trained with the data sets 
A, tra. For an optimal product set in DBPA scheme, all the data sets except T 
and windep were used in each set of numerals. The performance of individual 
classifiers is shown in Table 1 with recognition and reliability rates for respective 
test data sets T and windep. 

Table 1. Performance of individual classifiers 





Concordia T 


UCI windep 


Classifier 


Recognit ion ( % ) 


Reliability(%) 


Recognition(%) 


Reliability(%) 


El 


96.00 


96.00 


93.77 


93.77 


E2 


95.95 


95.95 


97.11 


97.11 


Es 


84.45 


96.24 


91.82 


96.95 


E4 


90.95 


99.02 


67.67 


93.11 


Es 


88.15 


98.38 


70.01 


94.80 


Ee 


94.15 


94.15 


96.66 


96.66 



The classifiers were combined with the test data sets T, windep, using the 
following combination methods: the BKS method in [7], and the several Bayesian 
combination methods as noted in Table 2. Among the Bayesian combination 
methods, CIAB method was proposed in [1], and ODBl, CODBl, and ODB2 
methods were proposed in [3], and CODB2 and ODB3 methods were proposed 
in [4], and the DODBl and DODB2 methods are proposed in this paper by using 
the A first-order and A second-order C-D mutual information, respectively. 



Table 2. Bayesian combination methods 



Notation 

CIAB 

ODBl 

CODBl 

ODB2 

CODB2 

ODB3 

DODBl 

DODB2 



Full Term 

conditional independence assumption-based method 
first-order dependency-based method 
conditional Ist-order dependency-based method 
2nd-order dependency-based method 
conditional 2nd-order dependency-based method 
3rd-order dependency-based method 
A Ist-order dependency-based method 
A 2nd-order dependency-based method 



The first experiment is to combine five classifiers evenly selected from the six 
classifiers, so six groups were made from 5G1 to 5G6. The best recognition rate 
in each group in Tables 3 and 4 was mainly obtained by the DODBl or DODB2 
method. 
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Table 3. Results of five classifiers on Concordia data set: T 



Group 


CIAB 


ODBl 


CODBl 


ODB2 


GODB2 


ODB3 


DODBl 


DODB2 


BKS 


5G1 


97.40 


96.95 


97.70 


97.65 


97.15 


97.15 


97.90 


97.80 


93.25 


5G2 


97.20 


97.20 


97.95 


97.80 


97.70 


97.25 


97.95 


97.90 


92.35 


5G3 


96.80 


96.80 


97.80 


97.50 


97.45 


97.45 


97.65 


97.80 


92.30 


5G4 


97.45 


97.10 


97.85 


97.35 


97.95 


97.95 


98.15 


97.95 


93.20 


5G5 


97.35 


97.00 


98.25 


97.95 


97.55 


97.55 


98.25 


97.90 


92.90 


5G6 


97.60 


96.85 


97.60 


95.85 


97.40 


97.40 


98.10 


97.95 


92.85 



Table 4. Results of five classifiers on UCI data set: windep 



Group 


CIAB 


ODBl 


CODBl 


ODB2 


CODB2 


ODB3 


DODBl 


DODB2 


BKS 


5G1 


97.16 


97.44 


98.00 


98.00 


96.38 


98.00 


98.05 


98.27 


93.16 


5G2 


97.05 


97.05 


97.61 


97.77 


97.44 


97.61 


97.89 


98.05 


93.21 


5G3 


97.22 


97.22 


97.77 


98.11 


98.00 


97.16 


98.44 


98.11 


93.60 


5G4 


96.99 


96.83 


97.83 


97.33 


96.66 


96.66 


98.27 


98.33 


93.99 


5G5 


96.99 


97.05 


97.61 


97.61 


97.72 


97.72 


97.55 


97.72 


93.04 


5G6 


97.33 


97.44 


97.38 


98.11 


98.22 


96.99 


98.27 


98.22 


93.88 



The second experiment is to combine all six classifiers, so two groups were 
made according to the source of test data, i.e. 6G1 is for Concordia University 
and 6G2 for UCI. The best recognition rate in each group in Table 5 was obtained 
by the DODBl or DODB2 method. 



Table 5. Results of six classifiers on data sets T and windep 



Group 


CIAB 


ODBl 


CODBl 


ODB2 


CODB2 


ODB3 


DODBl 


DODB2 


BKS 


6G1 

6G2 


97.95 

97.55 


97.20 

97.22 


98.00 

97.77 


97.80 

98.05 


97.90 

98.00 


97.90 

96.99 


98.05 

98.22 


98.10 

98.11 


91.40 

91.93 



The experimental results supported that the proposed DBPAs with Bayes 
error rate and their combination methods, DODBl and DODB2, contributed to 
improvement on the performance over other Bayesian combination methods by t- 
test at significance level 0.1 as for the group of five classifiers, although it required 
larger storage needs than the previous Bayesian methods for computing the A 
nth-order C-D mutual information. Particularly, the low recognition rates of the 
BKS method were caused by the lack of large enough and well representative 
training data sets. 

5 Concluding Remarks 

This paper extended the works of Wang and Wong to the second-order depen- 
dency and reviewed the dependency between class and decisions with the defined 
C-D mutual information and the BKS method, accordingly an algorithm using 
the A second-order C-D mutual information was described for the second-order 
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dependency, too. In order to raise the class discrimination power in combining 
multiple classifiers, the C-D mutual information was defined for the dependency 
between class and decisions, and the product approximation was found so that 
the upper bound of Bayes error rate should be minimized. Thus the best recog- 
nition rates were obtained with the proposed DBPAs and the superiority of their 
associated combination methods over other methods was statistically significant. 
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Abstract. We address a one-class classification (OCC) problem aiming 
at detection of objects that come from a pre-defined target class. Since 
the non-target class is ill-defined, an effective set of features discriminat- 
ing between the targets and non-targets is hard to obtain. Alternatively, 
when raw data are available, dissimilarity representations describing an 
object by its dissimilarities to a set of target examples can be used. 

A complex problem can be approached by fusing information from a 
number of such dissimilarity representations. Therefore, we study both 
the combined dissimilarity representations (on which a single OCC is 
trained) as well as fixed and trained combiners applied to the outputs 
of the base OCCs, trained on each representation separately. An experi- 
ment focusing on the detection of diseased mucosa in oral cavity is con- 
ducted for this purpose. Our results show that both approaches allow for 
a significant improvement in performance over the best results achieved 
by the OCCs trained on single representations, however, concerning the 
computational cost, the use of combined representations might be more 
advantageous. 



1 Introduction 

Novelty detection problems arise in applications, where anomalies or outliers 
should be recognized. Given training examples, the goal is to describe the so- 
called target class such that resembling objects are accepted as targets and 
outliers (non-targets) are rejected. Such a detection has to be performed in 
an unknown or ill-defined context of alternative phenomena. Examples refer to 
health diagnostics, machine condition monitoring, industrial inspection or face 
detection. The target class is assumed to be well sampled and well defined. The 
alternative non-target (outlier) set is usually ill-defined: it is badly sampled (even 
not present at all) with unknown and hard to predict priors. If available, such 
non-targets might be structured in ways not represented in the training set. For 
such types of problems one-class classifiers (OCCs) may be very suitable [15, 
10], as they are domain or boundary descriptors. 

Since the non-target class is ill-defined, in complex problems, an effective 
set of features discrimination between targets and non-targets cannot be easily 
found. Hence, it seems appropriate to build a representation on the raw data. 



F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 122—133, 2004. 
@ Springer- Verlag Berlin Heidelberg 2004 
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The dissimilarity representation, describing objects by their dissimilarities to the 
target examples, may be effective for such problems since it naturally protects 
the target class against unseen novel examples. Therefore, we will study dissim- 
ilarity representations to train one-class classifiers. Optimal representations and 
dissimilarity measures cannot be found if one of the classes is missing or badly 
sampled. On the other hand, when one analyzes a particular phenomenon, the 
model knowledge can be captured by various dissimilarity representations de- 
scribing different problem characteristics. In this way, a problem is tackled from 
a wider perspective: each additional representation may incorporate useful in- 
formation. Combining OCCs becomes, thereby, a natural technique needed for 
solving ill-defined (unbalanced) detection problems. Note, however, that stan- 
dard two-class classifiers should be preferred if the non-target class is well rep- 
resented. 

Although such problems are often met in practice, representative standard 
datasets do not exist yet. They should be based on the raw data and various 
dissimilarity measures should be available. Our procedures here are not intended 
for general multi-class problems for which other, more suitable, techniques exist. 
Our methodology is applicable to difficult problems where the target examples 
are provided with or without additional outlier examples. For that reason, the 
effectiveness of the proposed procedures is illustrated with just a single, yet 
complex, application, i.e. the detection of diseased mucosa in oral cavity. 

Two approaches are compared within this application. The first one focuses 
on combining dissimilarity representations into a single one, while the second 
approach considers a combiner operating on the outputs of the OCCs. This 
study extends results of our earlier work [9] devoted to usual classification tasks. 
Note, however, that OCCs do not directly estimate the posterior probabilities 
since they rely on information on a target class. OCCs output a sort of a signed 
distance to the boundary. 

2 One-Class Classifiers for Dissimilarity Representations 

Consider a representation set R = {pi,P 2 , ■ ■ ■ ,Pn}, which is a set of represen- 
tative objects. d{x,pi) denotes a dissimilarity between the objects x and pi, 
independently from their initial representations. In general, we do not require 
metric properties of d, since non-metric dissimilarities may arise when shapes 
or objects in images are compared; see e.g. [6]. The usefulness of d is judged by 
its construction and a fit to the problem; d should be relatively small for ob- 
jects resembling each other in reality and large for objects that differ. Obviously, 
the non-negativity and refiexivity, i.e. d{x, y) > 0 and d{x, x) = 0 are taken as 
granted. Thereby, a dissimilarity representation (DR) of an object x is expressed 
as a vector D{x, R) = [d{x,pi), d{x,p 2 ), . ■ . , d{x,pn)]- For a collection of training 
objects T = ■ ■ ■ , tAf}, it extends to a x n dissimilarity matrix D{T, R). 

In general, R might be a subset of T {R C T) or they might be distinct sets. 

There are three principal learning approaches, referring to three interpreta- 
tions of DRs, for which a particular methodology can be adapted. In the pre- 
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topological approach (I), the dissimilarity values are interpreted directly, hence 
they can be characterized in pretopological spaces [7, 12], where the neighbor- 
hoods play a significant role. The embedding approach (II) builds on a spatial 
representation, i.e. an embedded pseudo-Euclidean configuration such that the 
dissimilarities are preserved [5,8]. In the dissimilarity space approach (III), one 
considers D{x,R) : X as a data-depending mapping to the so-called dis- 

similarity space. In such a space, every dimension corresponds to a dissimilarity 
D{-,pi) to a particular object Pi€R. So, the dimensions convey a homogeneous 
type of information. The property that dissimilarities should be small for similar 
objects (belonging to the same class) and large for distinct objects, gives a pos- 
sibility for a discrimination. Thereby, D(-,pi) can be interpreted as an attribute. 

Below, some exemplar one-class classifiers are described, which in practice 
rely on some proximity /prox(a;, wt) of an object x to the target class cot is 
computed. To decide whether an object belongs to the target class or not, a 
threshold 7 on /prox should be determined. A standard way is to supply a fraction 
Cfn of (training) target objects to be rejected by the OCC (a false negative ratio) 
[14, 13]. This means that 7 is set up such that f 1{fpro^{x,LOT) >7) dn{x) =rf„, 
where I is the indicator function and p, is some measure. rf„ is a small value 
to prevent a high acceptance of outliers as targets. In other cases, 7 can be 
determined as the (1 — rthr)-percentile of the sorted sequence of the proximity 
outputs computed for the training (target) examples, rthr is then a user-specified 
fraction. Unless stated otherwise, R CT consists of the target examples only. 

I. Neighborhood-Based OCC. The nearest-neighbor data description 
(NNDD) is realized by the classifier Cnndd indirectly built in a pretopologi- 
cal space. The proximity function relies on the nearest neighbor dissimilarities. 
For n target training objects ti, a vector of averaged nearest neighbor dissimi- 
larities dnniti, R) = p X[j=i d{ti,p{.), where p{. is the j-th nearest neighbor of ti 
in R, is obtained. Then, a threshold 7 is determined based on the (1 — rthr)-th 
percentile of the sorted sequence of d„„. The classifier becomes then: 

k 

Ct^mB{D{x,R)) =J{dnn{x,R) <j) =J{^^d{x,pi)) <j), pi € R. ( 1 ) 

i=i 

II. Generalized Mean-Class OCC (GMDD). Assume a symmetric repre- 
sentation D{R, R), where R consists of the targets only. Any such matrix D can 
be embedded in a pseudo-Euclidean space ^ given dissimilarities are preserved 
perfectly [5,8,7]. {£ becomes Euclidean iff D is Euclidean.) If D{T,R), R cT, 
is given, then £ is determined by D{R, R) and the remaining T\R objects are 
then projected to £. In the embedded space £, a simple OCC can be designed 
relying on the distance to the mean vector of the target class. This can be, 
however, carried out without performing the exact embedding. It can be proved 

^ A pseudo-Euclidean space £ is a non-degenerate indefinite inner product 

space such that the inner product (•,•)£ is positive definite on TiF and negative 
definite on TZ‘^. {x,y)e = ELi XiVi - Ef^p+i XiPi and d£{x,y) = \\x-y\\£ = {x- 
y,x — y)s=d^p{x,y) — d^q{x,y). Since f is a linear space, many properties based 
on inner products can be appropriately extended from the Euclidean case. 
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that the proximity function fprox{x,ujT) = ||®£ ~ in £ (where xg is the 

projection of D{x,R) to £) is equivalently computed by the use of square dis- 
similarities as /prox(x, wr) = ^ Er=i <P{x,pi) - ^ YTi=i E”=i <P{Pi.Pj)i see [8, 
9, 7] for details. Then, a threshold 7 is determined as the (1— rthr)-th percentile of 
the sorted sequence of /prox(^i, wt). The generalized mean-class data description 
(GMDD) becomes then: 



- - !L It 

Cgmdd{D{x, R)) = I{— ^ (f{x,pi) — EE d^(pi,Pj) <l)- (2) 

i—1 i—1 j — 1 

III. Linear Programming Dissimilarity-Data Description (LPDD). This 
OCC was proposed by us in [9]. It is designed as a hyperplane H : w'^D{x, R)=p 
in a dissimilarity space that bounds the target data from above (we assume that 
d is bounded) and which is attracted towards the origin. Non-negative dissimi- 
larities impose both p>0 and Wi>0. This is achieved by minimizing p/||tc||i, 
which is the max-norm distance of the hyperplane H to the origin in the dissim- 
ilarity space. Hence, H can be determined by minimizing p— ||t(;||i. Assuming 
that ||m||i = 1 (to avoid any arbitrary scaling of w), H is found by the min- 
imization of p only. A target class is then characterized by a linear proximity 
function on dissimilarities with the weights w and the threshold p. The LPDD 
is then defined as: 



Clpdd(D(x, i?)) =2i( E ^ P)’ (3) 

Wj^O 



where Wj are found as the solution to a soft-margin linear programming for- 
mulation (the hard-margin case is then straightforward) with p € (0, 1] being 
the upper bound on the target rejection fraction in training (here v := is 
used) [9]: 

s.t. w'^D{pi, R)< p + Y.j Wj = 1, Wj > 0, p > 0, > 0, i = 1, 2, .., N. 

As a result, sparse solutions are obtained, i.e. only some Wj are non-zero. Ob- 
jects of R corresponding to such non-zero weights are called support objects 
(SO). The LPDD can be extended to handle example outliers as well. A la- 
bel variable yt G {-1-1,— 1} is used to encode the targets (1) and outliers (— 1). 
The formulation above remains the same, but the main constraint changes to 
Pi {w'^D{pi, R)) < y,p + 



2.1 How Good Is an OCC? 

To study the behavior of an OCC, the ROC curve [2, 14] is often used. It is a 
function of the true positive (target acceptance) versus the false positive (outlier 
acceptance) ratio. Example outliers are necessary for its evaluation. In principle, 
an OCC is trained with a fixed target rejection ratio rfn for which the threshold 
is determined. This OCC is then optimized for one point on the ROC curve. 
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Fig. 1. A ROC curve for the LPDD. 



To compare the performance of various classifiers, the AUC measure is used [1]. 
It computes the Area Under the Curve, which is the total OCC’s performance 
integrated over all the thresholds. The larger AUC, the better the OCC; e.g. 
in Fig. 1, the solid curve (the LPDD trained using outliers) indicates a better 
performance than the dashed curve (the LPDD trained on the targets only) . The 
stars indicate points for which the thresholds were optimized. 

2.2 Combined Representations 

Learning from distinct DRs can be realized by combining them into a new rep- 
resentation and then training a single OCC. As a result, a more powerful repre- 
sentation may be obtained, allowing for a better discrimination. Suppose that K 
representations i?), r = 1, 2, . . . , iC, all based on the same i?, are given. 

Assume that the dissimilarity measures are similarly bounded (if not they can 
be scaled appropriately), since only then we can somehow relate their values 
to each other (otherwise we would need to compare not the direct values but 
the corresponding percentiles). The DRs can be combined, for instance, in the 
following ways: 



-^comb 


Expression 


Avr 

Prod 

Min 

Max 


Dprod{U,Pj) = log (1 + ^^'^HU,Pj)) 



The DRs are combined into one representation by using a sort of fixed rules, 
usually applied when outputs of two-class classifiers are combined. Note that 
a DR can be interpreted as a collection of weak classifiers, where each of them 
is understood as a dissimilarity to a particular object pi. In contrary 

to probabilities, a small dissimilarity value D^'^\tj,pi) is an evidence of a good 
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‘performance’, indicating here that the object tj is similar to the target pi. In 
general, different dissimilarity measures focus on different aspects of the data. 
Hence, each of them estimates a proximity of an object x to the target pi as 
D^'^\x,pi). So, Havr yields an average proximity estimator. When, the dissim- 
ilarity measures are independent (e.g. one built on statistical and the other on 
structural object properties), the product combiner can be of interest. Logically, 
both Davr and Dprod should integrate the strengths of various representations. 
Here, Uprod is expressed such that very small numbers are avoided (they could 
arise when multiplying close-to-zero dissimilarities). The min operator chooses 
the minimal dissimilarity value D^'^\x,pi), r=l, . . . ,K, hence the maximal ev- 
idence for an object x resembling the target U. The max operator works the 
other way around. 

2.3 Combined Classifiers 

One usually combines classifiers based on their posterior probabilities. The out- 
puts of the OCCs may be converted to estimates of probabilities [14] and stan- 
dard fixed combiners, such as mean, product and majority voting, can be consid- 
ered. Here, we also like to proceed with the exact OCCs outputs. For this reason, 
we focus the LPDDs. Each LPDD is determined by a hyperplane in the 
dissimilarity space D^'^^{T,R). The distances to the hyperplane are realized by 
weighted linear combinations of the form — P- 

As a result, one may construct an n x AT dissimilarity matrix Du = [d^^(T), 
. . . , (T)] expressing the non-normalized signed distances between the n train- 

ing objects and K ‘base’ classifiers. Hence, again an OCC can be trained on Dh- 
This means that an OCC becomes a trained combiner now, re-trained by using 
the same training set (ideally, an additional validation set should be used) . The 
LPDD can be used again, as well as some other feature-based OCCs. (Although 
the values of Dh become negative for the targets and positive for the outliers, 
they are bounded, so the LPDD can be constructed.) Additionally, two other 
standard data descriptions (OCCs) are used, where a proximity of an object 
to the target class relies on the fc-mean information or density estimation by 
the Parzen kernels [13], respectively (the appropriate thresholds are set up as 
described in section 2). 

3 Experiments and Results 

The data consist of autofiuorescence spectra acquired from healthy (target) and 
diseased (outlier) mucosa in the oral cavity [11,16]. The measurements were 
taken at 11 different anatomical locations using six excitation wavelengths 365, 
385, 405, 420, 435 and 450 nm. We will denote them by v\ — vq. After prepro- 
cessing [16], each spectrum consists of 199 bins. In total, 856 and 132 spectra 
representing healthy and diseased tissue, respectively, are given for each wave- 
length. The spectra are normalized to have a unit area; see also Fig. 2. Two cases 
are here investigated: combining various DRs for a fixed wavelength of 365 nm 
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S' Normalization by area 





Fig. 2. Examples of normalized autofluoresence spectra for healthy (left) and diseased 
(right) patients for the excitation wavelength of 365 nm. 



(experiment I) and combining representations derived for all the wavelengths 
(experiment II). 

The objects are 30 times randomly split into the training set T and the test 
set S in the ratio of 60% : 40%, respectively. R, R C T, consists of the targets 
only, while T contains additional outliers. |i?| = 514, |T| = 594 and [S'! = 394 
(337/57 healthy/diseased patients). If an OCC cannot use outlier information 
in the training stage, then it relies on {R, R) only. In the testing stage, 
R) are used. Since we want to combine the representations directly, they 
should have a similar range. This is achieved by scaling all the initial by 
the maximal value of 7?%^ determined on the training data. So, further on, 77%) 
are assumed to be scaled appropriately. The LPDD is trained with jz = 0.05 and 
the 3-NNDD and the GMDD use the threshold of 0.05. If the LPDD is trained 
using outlier information, it is denoted as Clpdd, otherwise, as Clpdd- Trained 
combiners use the zero threshold. All the experiments are done using DD-Tools 
[13] and PRTools [3]. 

Five dissimilarity representations 77%) — 77%) are considered for the normal- 
ized spectra in experiment I (wavelength 365 nm) . The first three DRs are based 
on the 1 1 (city block) distances computed between the smoothed spectra them- 
selves (77%)) and their first and the second order Gaussian-smoothed (ct = 3 
samples) derivatives (77%) and 77%), respectively). The zero-crossings of the 
derivatives indicate the peaks and valleys of the spectra, so they are informative. 
The differences between the spectra focus on the overlap, the differences in first 
derivatives emphasize the locations of peaks and valleys, while the differences in 
second derivatives indicate the tempo of changes in spectra. 77%) is based on the 
spherical geodesic distance d(4)(a7,y) = r arccos(a7^y)/l^. 77%) is based on the 
Bhattacharyya distance, a divergence measure between two probability distribu- 
tions. This measure is applicable, since the normalized spectra, say, Si, can be 
considered as unidimensional histogram-like distributions. They are constant on 
disjoint intervals h,. ■ . , In, such that Si{x) = where /i* > 0. 

The Bhattacharyya distance [4] is then: d(5)(si, s^) = — log (X]^i |/z|. 
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Table 1. Experiment I: the AUC performances (in %), averaged over 30 runs, of OCCs 
built either on the combined DRs or fixed and trained combiners applied to the OCCs 
outputs. All DRs are considered for the excitation wavelength of 365 nm. SO denotes 
support objects. The standard deviations of the means are in parenthesis. 



Single DRs: OCCs trained on 


DR 


Cs-NNDD (1) 


Cgmdd ( 2 ) 


Clpdd (3) 


#so 


Clpdd (3) 


#SO 




80.9 (0.5) 


77.0 (0.6) 


72.3 (0.7) 


2.5 


79.6 (0.5) 


5.5 


J)(2) 


86.0 (0.4) 


78.4 (0.5) 


72.0 (0.7) 


2.8 


83.1 (0.5) 


5.8 


J)(3) 


86.7 (0.4) 


78.1 (0.6) 


78.1 (0.7) 


2.9 


84.2 (0.5) 


5.3 


dW 


81.8 (0.5) 


76.6 (0.6) 


68.0 (0.9) 


2.9 


80.2 (0.5) 


6.1 


J)(5) 


85.5 (0.4) 


77.3 (0.5) 


75.1 (0.6) 


2.1 


80.1 (0.5) 


2.5 


Combined DRs: OCCs trained on Dcomb ( 


1 


-^comb 


Cs-NNDD (1) 


Cgmdd ( 2 ) 


Clpdd (3) 


#SO 


Clpdd (3) 


#SO 


Avr 


95.5 (0.2) 


94.6 (0.3) 


93.0 (0.3) 


4.1 


93.4 (0.3) 


5.1 


Prod 


95.7 (0.2) 


94.9 (0.3) 


93.6 (0.3) 


4.6 


93.6 (0.4) 


7.6 


Min 


85.6 (0.4) 


84.6 (0.4) 


84.7 (0.5) 


14.6 


87.1 (0.9) 


15.7 


Max 


93.5 (0.3) 


90.6 (0.4) 


84.7 (0.8) 


7.1 


89.0 (0.6) 


10.5 


Fixed combiners built on the OCCs outputs from dCI — 


75 ( 5 ) 


Combiner 


Cs-NNDD (1) 


Cgmdd ( 2 ) 


Clpdd (3) 




Clpdd (3) 




Mean 


98.0 (0.2) 


94.4 (0.4) 


90.7 (0.6) 


— 


93.8 (0.3) 


- 


Prod 


98.0 (0.1) 


81.3 (0.6) 


87.8 (0.5) 


— 


91.1 (0.3) 


— 


Voting 


98.3 (0.1) 


95.9 (0.2) 


95.5 (0.2) 


— 


97.0 (0.2) 


— 


Trained combiners built on the LPDDs outputs from 




Combiner 


Cs-NNDD (1) 


Cgmdd ( 2 ) 


Clpdd (3) 


#SO 


Clpdd (3) 


#SO 


LPDD 


— 


— 


90.1 (0.5) 


4.9 


95.8 (0.2) 


5.0 


5-means 


— 


— 


88.0 (0.4) 


— 


91.1 (0.4) 


— 


Parzen 


— 


— 


90.5 (0.4) 


— 


94.5 (0.3) 


— 



In experiment II, DRs are derived for all excitation wavelengths. The first 
three measures are used. For each measure, six DRs are combined 

over the excitation wavelength vi — vq and, in the end, all 18 DRs are combined, 
as well. 

Fixed combiners are also built on the outputs of single OCCs (the outputs 
need to be converted to posterior probabilities, e.g. as in [14]). Additionally, 
trained OCC combiners are constructed on the outputs of single LPDDs. The 
trained combiners are the LPDD and the /c-means and Parzen data descrip- 
tions [13]. 

The following observations can be made from experiment I; see Table 1. Both 
an OCC trained on the combined representations and a trained or fixed com- 
biner on the OCCs outputs improve the AUC performance of each single OCC 
trained on Concerning the combined representations, the element-wise av- 

erage and product combiners perform better than the min and max operators. 
The 3-NNDD seems to give the best results; they are somewhat better than the 
ones obtained from the GMDD and and the LPDD trained on Dcomh{T, R). How- 
ever, in the testing stage, both the 3-NNDD and the GMDD rely on computing 
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dissimilarities to all 514 objects of R, while the LPDD is based on maximum 16 
support objects (see #50 in Table 1; the SO are determined during training). 
Hence, if some outliers are available for training, the LPDD can be recommended 
from the efficiency point of view. The fixed and trained combiners on the OCCs 
outputs perform well. In fact, the best overall results are reached for the fixed 
majority voting combiner. However, combiners require more computations; first 
five OCCs are trained on each separately and then, the final combiner is 
applied. Yet, if the LPDD is used for training, then the testing stage is 

cheap: the dissimilarities to 27 objects have to be computed (sum of the support 
objects for single representations). 

Due to lack of space, in Table 2 only some (the best) combining techniques 
are presented. Again, both an OCC trained on the combined representations 
(by the average and product) and a fixed or trained combiner on the OCCs 
outputs significantly improve the AUC performance (by more 10%) of each sin- 
gle OCC. By using all the six wavelengths and three dissimilarity measures (18 
in total), all the combining procedures yield nearly perfect performances, i.e. 
mostly 99.5% or more. The trained combiners on the LPDDs outputs are some- 
what worse (possibly due to overtraining) than the majority voting combiner, 
however, they are similar to the results of the mean combiner. Since the spectra 
derived from various wavelengths describe different information, an OCC built 
on their combined representation allows for reaching a somewhat better AUC 
performance than an OCC built on the DR combined for a single wavelength. 
From the computational point of view, either an LPDD trained on the combined 
DR or a fixed voting combiner on the LPDDs outputs should be preferred. 



4 Conclusions 

Here we study procedures of detecting one-class phenomena based on a set of 
training examples, performed in an unknown or ill-defined context of alternative 
phenomena. Since a proximity of an object to a class is essential for such a 
detection, dissimilarity representations (DRs) can be used as the ones which 
focus on object-to-target dissimilarities. The discriminative properties of various 
representations can be enhanced by a proper combining. Three different one- 
class classifiers (OCCs) are used: the NNDD (based on the nearest neighbor 
information), the GNMD (a generalized mean classifier in an underlying pseudo- 
Euclidean space) and the LPDD (a hyperplane in the corresponding dissimilarity 
space), which offers a sparse solution. 

DRs directly encode evidences for objects which lie in close or far neigh- 
borhoods of the target objects. Hence, they can naturally be combined (after a 
proper scaling) into one representation, e.g. by an element-wise averaging. This 
is beneficial, since only one OCC can be trained, ultimately. From our study on 
the detection of diseased mucosa in oral cavity, it follows that DRs combined 
by average or product have a larger discriminative power than any single one. 
We also conclude that by combining information of DRs derived for spectra of 
different excitation wavelengths is somewhat more beneficial than by using only 
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Table 2. Experiment II: the AUC performances (in %), averaged over 30 runs. Single 
DRs: single OCCs built on DRs for six excitation wavelengths (only the worst and the 
best AUCs; |#S'0| = 2 — 7 for the LPDD). Combined DRs: OCCs built on the Dcomb 
combined over six wavelengths and fixed Fixed combiners: fixed rules applied 

to the outputs of the trained OCCs and trained combiners: combiners trained on 
the outputs of the LPDDs, both combined over six wavelengths. ‘ALL’ refers to the 
results on all 6 x 3 (six wavelengths and three measures) DRs. SO denotes support 
objects. 



Single DRs: OCCs trained on for different 



Cs-NNDD 



Cgmdd 



Clpdd 









ALL 


80.9 - 84.8 (0.5) 


82.8 - 87.0 (0.5) 


83.5 - 88.8 (0.5) 


80.9 - 88.8 (0.5) 


77.0 - 79.4 (0.7) 


77.9 - 81.7 (0.6) 


75.4 - 81.6 (0.6) 


75.4 - 81.7 (0.7) 


62.8 - 72.4 (0.8) 


65.5 - 72.8 (0.8) 


70.7 - 77.5 (0.8) 


62.8 - 77.5 (0.8) 


78.3 - 81.7 (0.9) 


73.5 - 83.1 (0.7) 


77.7 - 83.2 (0.6) 


73.5 - 83.2 (0.6) 



Combined DRs: OCCs trained on Dcomb combined over v\ — vq 



Ca — NNDDj -^comb 



Avr 

Prod 



^^GMDD^ -^comb 



Avr 

Prod 



Clpdd? -Ocomb 



Avr 

Prod 



Dcom 



Avr 

Prod 



D' 



97.7 (0.2) 

97.7 (0.2) 



O'- 



97.2 (0.2) 

97.3 (0.2) 






96.6 (0.3) 
96.9 (0.2) 






96.7 (0.1) 

96.8 (0.1) 



97.6 (0.2) 

97.7 (0.2) 



97.2 (0.2) 

97.4 (0.2) 



97.1 (0.3) 

97.2 (0.3) 



97.1 (0.1) 

97.2 (0.2) 



O'-- 



96.8 (0.1) 

96.9 (0.1) 



D" 



96.0 (0.1) 

96.3 (0.1) 






95.6 (0.2) 

95.8 (0.2) 






95.6 (0.1) 

95.8 (0.1) 



ALL 



99.6 (0.0) 

99.7 (0.0) 



ALL 



99.6 (0.0) 

99.6 (0.0) 



ALL 



3.6 99.5 (0.1) 

3.7 99.6 (0.0) 



ALL 



3.6 99.5 (0.0) 
5.0 99.6 (0.1) 



Fixed combiners applied to the OCCs outputs 



Ca-NNDD outputs 




O'- 



97.8 (0.1) 

98.6 (0.1) 

97.6 (0.1) 



D' 



94.3 (0.4) 
96.0 (0.2) 
96.7 (0.2) 



98.0 (0.1) 
98.5 (0.1) 
98.7 (0.1) 



94.2 (0.3) 

96.4 (0.1) 

97.4 (0.1) 



■I 

I 



O'-' 



98.2 (0.2) 
98.6 (0.1) 
98.6 (0.1) 



O'-' 



94.3 (0.3) 
96.7 (0.1) 
97.6 (0.1) 




ALL 



99.6 (0.1) 
99.6 (0.0) 
99.8 (0.0) 



ALL 



98.3 (0.2) 
99.7 (0.0) 
99.6 (0.1) 



■ 

I 



Fixed and trained combiners applied to the Clpdd outputs 



Combiner 







92.7 (0.4) 

95.7 (0.9) 
95.7 (0.4) 



89.3 (0.4) 
92.1 (0.3) 



92.9 


(0.4) 


95.7 


(1.0) 


96.8 


(0.2) 


91.5 


(0.4) 


94.4 


(0.3) 



D''' 



91.8 (0.3) 

95.7 (0.5) 

97.9 (0.1) 



94.6 (0.2) 

94.9 (0.3) 



ALL 


94.5 


(0.2) 


98.7 


(0.6) 


99.3 


(0.1) 


96.6 


(0.3) 


98.2 


(0.1) 



Fixed and trained combiners applied to the C£pl)d outputs 



Combiner 







93.7 (0.4) 
95.4 (0.8) 
96.3 (0.4) 



95.7 (0.2) 

95.5 (0.2) 



93.6 


(0.5) 


96.2 


(0.9) 


96.8 


(0.2) 


96.5 


(0.2) 


96.8 


(0.2) 



O'" 



95.6 (0.4) 

97.2 (0.5) 
98.0 (0.1) 



95.8 (0.2) 

96.2 (0.2) 



ALL 


98.8 


(0.3) 


99.5 


(0.6) 


99.5 


(0.1) 


99.1 


(0.1) 


98.9 


(0.1) 
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one fixed wavelength, yet different dissimilarity measures. In the former case, 
all the OCCs on the combined representations performed about the same, while 
in the latter case, the LPDD trained on the targets seemed to be worse. The 
fixed OCC combiners have also been applied to the outputs of single OCCs. The 
overall best results are reached for the majority voting rule. The trained OCC 
combiners, applied to the outputs of single LPDDs, performed well, yet worse 
than the voting rule. Concerning the computational issues, either the LPDD on 
the combined representations should be used or the majority voting combiner 
applied to the LPDDs outputs. 

Further studies on new problems need to be conducted in the future. 
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Abstract. While the field of classification is witnessing excellent achievement 
in recent years, not much attention is given to methods that deal with the time 
series data. In this paper, we propose a modular system for the classification of 
time series data. The proposed approach explores the diversity through various 
input representation techniques, each of which focuses on a certain aspect of 
the temporal patterns. The temporal patterns are identified by aggregation of the 
decisions of multiple classifiers trained through different representations of the 
input data. Several time series data sets are employed to examine the validity of 
the proposed approach. The results obtained from our experiments show that the 
performance of the proposed approach is effective as well as robust. 



1 Introduction 

Many real applications are interested in the knowledge varying over time. Currently, 
the temporal classification has been widely adopted in areas such as climate control 
research, medical diagnosis, economic forecast, sound recognition etc. However, while 
the classification of the non-temporal information has great achievement in recent years, 
temporal classification techniques are scarce. Temporal patterns contain dynamic infor- 
mation and are difficult to be represented by traditional approaches. Classifiers such as 
neural nefwork, decision trees don’t function well when applied to temporal data di- 
rectly. Other solutions which employ temporal models are normally domain-dependent 
and could not be applied to general problems. In the current directions of the machine 
learning research. Multiple Classifier Systems(MCSs) have been proved to be an effec- 
tive way to improve the classification performance and are widely used to achieve high 
pattern-recognition performances [9]. Currently, they have become one of the major 
focuses in this area. Generally, a MCS is composed of three modules: diversification, 
multiple decision-making and decision combination. Fig. 1 demonstrates a general con- 
ceptual framework for MCSs. 

Multiple decision-making is normally regarded as a process in which the classi- 
fier searches correct answers in its knowledge space. The diversity module collectively 
covers the knowledge space available. In the decision combination module, multiple 
decisions are combined under different rules to make the final decisions. The use of 
MCSs may be an effective way for temporal classification problem. Allhough temporal 
palterns are often complex and difficult to identify, this task can be divided into several 

F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 134-143, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 




A Modular System for the Classification of Time Series Data 



135 



Input 




Output 



Multiple Classifier System 



Fig. 1. Conceptual Framework of MCSs 



simpler sub-tasks. Therefore, if it is possible to find an ensemble of classifiers, each 
of which effectively fulfills one of fhe sub-tasks, the temporal patterns could he well 
identified through the aggregation of solutions of the ensemble. 

The organization of this paper is as follows: section 2 reviews the related work. 
In section 3, we propose a modular system to achieve diversity at the input level. The 
experiments and discussion are presented in section 4. Section 5 concludes the work of 
this paper. 

2 Related Works 

The issue of diversity has long been the focus of many research works on MCSs. Here 
we review work related to temporal data classification. Diversity at the training level 
is achieved by combining an ensemble of homogeneous or heterogeneous classifiers. 
Sanchos et. al.[14] employ an ensemble of five base classifiers which are either Self- 
Organizing Mapping(SOM) or Multiple Layer Perceptron(MLP) for the classification 
of the time series data. Each of the SOM or MLP shares the same architecture but has 
different initial weights. While it is quite innovative to combine the supervised and 
unsupervised learning in this research, it has several weaknesses. At first, the euclid- 
ian distance is used to calculate the posteriori probability of each class in the SOM 
classifiers. However, it is known that the euclidian distance is not very effective when 
measuring the similarity of the time-dependent data. Therefore, it is some questionable 
to adopt this method to generate the posteriori probability. In addition, the mapping 
between the cluster and the class label may not be the one to one mapping which is 
assumed implicitly by the author. Ghosh et. al.[6, 8] employ an ensemble of Artificial 
Neural Networks(ANNs) with the weighted average fusion technique to identify and 
classify the underwater acoustic signals. The feature space of the input data is com- 
posed of a 25-dimensional feature vectors extracted from the raw signals. In the feature 
vectors, there are 16 wavelet coefficient, 1 value denoting signal duration and 8 other 
temporal descriptors and spectral measurements. One of the weaknesses for this re- 
search is that the feature extraction is ad hoc. The number and type of the features may 
be only optimal for certain applications such as the oceanic signal data. 

The previous techniques mainly achieved the diversity of the MCSs at the train- 
ing level by introducing the different type of classifiers or different sets of parameters. 
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The work of Valentini and Masulli [17] states the trade-off between accuracy and inde- 
pendence of the base classifiers. Therefore, the number of independent classifiers with 
high accuracy may be limited, which could limit the capability of MCSs to achieve the 
diversity at the training level. While the previous researchers were interested in com- 
bining the different classifiers, recent focus in this area has shifted to diversity at the 
input level. Gonzalez et.al. [7] and Diez et.al. [3] propose interval based classification 
systems for time series data, in which the AdaBoost technique is adopted . The basic 
idea is that each time series data is represented by a set of temporal predicates and the 
final classification results are obtained based on the combination of these predicates. 
The objective of this research, as claimed by the author, is to find a non-domain specific 
temporal classification technique. However, the selection and definition of the tempo- 
ral predicates may be ad hoc in themselves. In addition, it is also difficult to decide 
the parameters such as the intervals of the variables and the size of the searching win- 
dow. Dietrich et. al. [5] propose three different kinds of architectures for the fusion 
of decisions based on these local features:Classification, Decision Fusion and Temporal 
Fusion(CDT), Decision Fusion, Classification and Temporal Fusion(DCT) and Classifi- 
cation, Temporal Fusion, Decision Fusion(CTD). CDT and CTD adopt the hierarchical 
fusion architecture. They differ from each other in first fusing the decisions of the base 
classifiers on different type of features in the same window or the same type of features 
over various windows. DCT first connects the p features in each window into a vector. 
Then, the final decision is made based on fhe fusion of the decisions of the base clas- 
sifiers on fhe p-feafure vector in each window. In [4], Dietrich et.al. further proposed 
three another architectures for fusion of decisions: Multiple Decision Template(MDT), 
Decision Template(DT) and Clustered Multiple Decision Template(CMDT). MDT and 
DT are similar with CDT and DCT respectively. CMDT improves MDT by assigning a 
set of multiple templates {DT^ ,DTf , ..., DTj^} to each class Wi. The weakness of this 
research is concluded as following:(l) The decision of the parameters such as the size 
of the sliding window is data-dependent. (2) The sliding window method may cause 
information loss for high-order temporal patterns. Hsu et.al. [10] propose a hierarchical 
mixture model called specialist-moderator network for the time series classification. It 
combines recurrent ANN classifiers in a botfom-up archifecfure. The primary contribu- 
tion of this work is the model’s ability to identify the intermediate targets based on a 
partition of the input channels. One of the common drawbacks of the hierarchical clas- 
sifier sysfem like hierarchical mixfure of experts is that it is not robust and the failure of 
one base classifier immediafely affecfs the performance of the whole system. 

3 Modular Approach 

The validity of the diversity is based on the following two observations: (1) the abil- 
ity of the individual classifier to identify the discriminating patterns is limited in some 
situations. (2) the set of the patterns identified by a classifier Ci depends on fhe repre- 
senfafion of fhe input data. The representation technique is defined as a mapping : 

f-.D^R (1) 

Where D is fhe input data space and R is the result data space, that is Rj = fj (D). 
Suppose that Patterrii {Rj ) represents the pattern set identified by classifier Ci under 
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the representation technique fj{D) and Pattern{D) represents all the patterns in the 
data set D. Therefore, there is: 

L{ = Pattern{D) — Patterrii{Rj) (2) 

where is the set of the patterns hidden from classifier Ci with the representation 
technique fj{D). Suppose each pattern is equally important. From equation 2, it is clear 
that the size of the set should be minimized to achieve the optimal performance. 
Let Lp represents the set of the patterns hidden from the ensemble of classifiers C = 
{Ci}, in which the base classifier Ci adopts a certain representation technique Rj (R = 
Therefore, there is 

L M 



^c=c\r\^i 

i=ij=i 


( 3 ) 




( 4 ) 



where i = l..L,j = 1..M. Equation 4 shows MCSs may be superior to any combi- 
nation of the individual classifier and representation technique if the diversity is ap- 
propriately balanced. In this research, we mainly focus on examining the effects of 
the diversity achieved through different input representations. In this paper, a set of 
homogeneous classifiers C is employed with various representation techniques R = 
{ R\, ...Rm }■ Compared to resampling [2] and input decimation [12], there is less in- 
formation loss when processing the input information. The decomposition of the input 
space reduces the information available for the individual classifiers. Thus, the classi- 
fiers may not be well trained. The decomposition of the feature space may conceal some 
high-order patterns, therefore degrades the performance of the whole system. While the 
diversity through input transformation [16] overcomes the drawbacks of the resampling 
and input decimation, the achieved diversity is limited. For example, the global in- 
formation of the times series data such as the mean, variance is also important to the 
classification. The proposed approach allows various representation of input besides the 
transformation and tends to have a better performance. Compared to the non-temporal 
data, it is more difficult for the temporal patterns to be represented by individual clas- 
sifiers in general. Therefore, the set tends to be large and the performance of the 
individual classifiers may not be good in some situations. In the proposed approach, 
each Temporal Representation Processor(TRP) is an expert to a certain type of tem- 
poral information, which implements a certain representation technique, fj{D). The 
complete information in the temporal patterns could be well identified by aggregation 
of the classifiers trained with different representation of the input data. Most impor- 
tantly, this strategy doesn’t assume the comprehensiveness of any TRP for classifier Ci. 
That is, each TRP may just process only part of the temporal information. Thus, the 
complexity in the design of TRP is significantly reduced. Fig. 2 shows the architecture 
of the proposed approach. 

In the proposed approach, each TRP maps the time series data X, (X = {Xt}, t = 
{1...T}), to another domain, R. The TRP is composed by a series of components, each 
of which implements a certain algorithm to process the time series data. Several repre- 
sentation techniques for the time series data are discussed in the following. 
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Fig. 2. Proposed Architecture 



- Filter 

The filter is employed to remove a certain type of information in the time series 
data. For example, the local fluctuations of the time series data should he removed 
if the objective is to obtain the smoothed values. In contrast, the long-term fluc- 
tuations should be removed to get the residuals from the smoothed value. There 
are various types of filters depending on their functional requirements. Among the 
popular schemes to produce a smoothed time series are the exponential smoothing 
scheme which weight past observations using exponentially decreasing weights. 
The family of the exponential smoothing schemes includes Single Exponential 
Smoothing(SES), Double Exponential Smoothing(DES) and Triple Exponential 
Smoothing(TES). The difference of these methods is the number of parameters 
which are used to represent the trend and seasonal information in the time series 
data. There is no parameters to represent the trend and seasonal information in SES. 
While the trend information is given into consideration in DES, the seasonal infor- 
mation is just neglected. The TES is the most complex one. It employs a set of 
parameters to represent both the trend and seasonal information. As an example, 
the function of DES is given as: 

St = aXt -f (1 — + bt-i) 

bt = l{St - St-i) + (1 - l)bt-i 

{X2 — Xi) + (X3 — X2) + {Xi — ATa) 

On = 

3 

where Si is the estimated value space, Xi is the input time series data, a is the 
smoothing rate which are in the interval [0,1]. When a ^ 1, the smoothed data is 
exactly the same as the original one. 7 is the learning rate. The larger the value of 
7 is, the more the estimated data is affected by the previous experience. Obviously, 
different combinations of the parameters generate different series of temporal data. 
The DES(a, 7) stands for the algorithm with the parameters a and 7. 
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- Spectral Analyst 

Spectral analysis is a useful exploratory diagnostic tool in the analysis of many 
types of the time series data. It provides the information of the ’hidden periodicity’ 
in the data. The most widely used technique in the spectral analysis is the Discrete 
Fourier Transformation(DFT). The effects of the transformation is to map the data 
in the time domain to the frequency domain. The transformed data could be directly 
fed into the traditional classifiers. Other spectral analysis methods that are also used 
quite often include the Wavelet transform, Hilbert transform, short-time Fourier 
transform. 

- Differentiation Generator(DG) 

The differentiation generator is a special type of filter which calculates the differ- 
entiation within a temporal channel. A differentiation generator DG{p, q) has two 
parameters: order and distance, which are represented by p, q respectively. For the 
first order and one distance differentiation generator, it is calculated as: 

DG{l,l)=Xt-Xt-i (5) 

Similarly, the p order and q distance differentiation generator is calculated by: 

DG{p, q) = DG{p -l,q + t)~ DG{p - 1, t) (6) 

The differentiation generator has several functionalities. At first, it could remove the 
effects of the initial values. For example, the effects of the shift could be eliminated 
by DG(1,1). In addition, it could remove the trend information in the time series 
data by differentiating the given temporal channel until it becomes stationary. This 
is helpful to those TRPs which only focus on the stationary information in the time 
series data. 




Fig. 3. Temporal Representation Processor 

The components in TRP could be connected either sequentially or in parallel to ful- 
fill a certain task. The design of TRPs depends on the representation logic. For example, 
if the functionality of a certain TRP is to “smooth the time series data and eliminate the 
’shift’ effects”, the task is fulfilled by the sequential connection of the filter and dif- 
ferentiation generator (Fig. 3). The connection mechanism of the components further 
expands the ability of TRPs to process the temporal information. 

4 Experimental Results and Discussion 

In this section, we use several popular time series data sets to demonstrate the feasi- 
bility of the proposed approach. These data sets are used by previous researchers and 
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Table 1. Characteristics of the Data Sets 



Data Set Classes Instances Frames 



CBF 


3 


600 


200 


Control Chart(CC) 


6 


600 


60 


CC+Label Noise 


6 


600 


60 


Waveform(WF) 


3 


600 


21 


WF+Data Noise 


3 


600 


40 



available from the UCI repository or related references. The characteristics of the data 
is summarized in table 1 . 

- The CBF data is introduced by Saito[13] as an artificial problem. The learning task 
is to distinguish the data from three classes:Cylinder(c), Bell(b) and Funnel(f). The 
models of the three classes are: 

c{t) = (6 + 7 ?) -Xla.hlW +e(f) 
b{t) = (6 + ? 7 ) • X[a,b] (t) ■ I — - + e{t) 

fit) = (6 + ?7) • X[a,b]W ■ T — - + 

7] and e{t) are obtained from the standard normal distribution A^( 0 , 1 ). a is an inte- 
ger obtained from the uniform distribution in [16,32]. c is another integer obtained 
from the uniform distribution in [32,96]. The integer b is the sum of a and c. X[a,6] 
= 1 if f G [a, 6 ]. Otherwise, X[a,6] = 0- 

- The CC data(from UCI repository) contains 600 examples of control charts which 
were synthetically generated. There are six different classes of control charts: (A) 
Normal (B) Cyclic (C) Downward shift (D) Upward shift (E) Increasing trend (F) 
Decreasing trend. 

- The data set Control CharH-Label Noise(CCH-LN) is synthetically produced in this 
research to examine the performance of the proposed approach in the presence of 
label noise in the data. This data is generated in the same way as the control chart 
data except that the label of the training data is randomly generated. That is, there 
is a subset of training data that may not be correctly labelled. 

- The Waveform was introduced by Breiman et al.[l]. The purpose is to distinguish 
between three classes, defined by the evaluation in the time frames t = 1 , 2 . .. 21 , of 
the following models: 

xi{t) = uhi{t) + (1 - u)h 2 {t) + e{t) 

X 2 {t) = uhi{t) + (1 - u)h 2 ,{t) + e{t) 

X 3 {t) = uh 2 {t) + (1 - u)h 3 {t) + e{t) 

where hi{t) = max{6 — \i — 7|, 0), / 12 (f) = hi{t — 8 ), / 13 (f) = h\{t — 4). u is a 
random variable with uniform distribution in ( 0 , 1 ) and e{t) is the standard normal 
distribution. 
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- The Waveform+Data Noise(WF+DN)[7] is generated in the same way as the pre- 
vious models except that 19 frames are added to the end of each time series. The 
value of the 19 frames follows the standard normal distribution A^(0, 1). This data 
set is used to test the performance of the proposed approach in the presence of the 
data noise in the data set. 

The experiments are designed to examine the accuracy as well as the reliability of 
the proposed approach. The single classifier is employed as the benchmark to evaluate 
the performance of the MCSs. Raw, DFT and DES+DG represent the single Probabilis- 
tic Neural Network(PNN) [15] classifier with raw time series data, the data with the 
DFT and the data processed with the DES and DG respectively. The test results are 
shown in table 2, which summarizes the Mean of Correct Ratio(MCR) and Varaiance 
of Correct Ratio(VCR) of the different approaches from 10 continuous experiments. 
The entry is in the format of MCR%(VCRx 10“^). For the Bagging approach, an en- 
semble of 12 homogeneous PNN classifiers are employed. We adopt the bootstrapping 
resampling technique in this experiment. Each base classifier is trained with 80 % of 
the training data which are randomly selected from the training data set. 



Table 2. Experimental Results 



DataSet CBF CC CC-h5% CC-(-10% WF WF-^DN 

Raw 72.5(1.0) 81.0(1.8) 81.8(1.4) 77.9(2.3) 93.7(0.3) 90.5(0.2) 

DFT 84.8(0.5) 72.8(2.2) 73.5(1.3) 70.2(4.1) 86.1(0.2) 83.2(0.9) 

DES-i-DG 77.0(0.4) 85.8(0.4) 84.2(0.4) 80.6(1.1) 92.2(0.4) 88.4(0.2) 

Bagging 72.2(0.8) 80.6(1.7) 79.5(0.9) 80.3(1.2) 93.8(0.1) 89.4(0.2) 

NTRPs-BF 71.3(0.7) 84.0(5.3) 78.0(2.1) 78.0(2.2) 91.7(0.5) 89.7(0.3) 

TRPs-BF 85.5(4.0) 88.6(1.8) 88.4(1.7) 83.7(5.7) 92.9(0.3) 90.7(0.2) 

TRPs-DT 91.0(0.8) 92.2(0.4) 91.3(0.8) 90.5(0.6) 92.9(0.4) 90.8(0.7) 



We examine the proposed diversity approach with different fusion techniques in- 
cluding the Bayesian Eusion(TRPs-BF) and Decision Template(TRPs-DT) approaches 
[11]. Eor the these experiments, a MCS with 12 homogeneous PNN base classifiers are 
employed, which are trained with different representation of the time series data. One 
of the TRPs is composed of 1 spectral analyst which implements the DFT algorithm. 
Another TRP implements a dummy function which just presents the original data to 
the classifier. Other TRPs are composed of a filter, which implements the DES and 
DG sequentially. 10 different TRPs are generated by using different parameters of the 
DES. Einally, we also examine the effects of diversity at the training level. The tuple 
NTRPs-BF in table 2 stands for the MCS which is combined with the bayesian fusion 
but has no diversity at the input level. It employs a homogeneous set of 12 PNNs with 
the raw data representation. In all of the experiments, the data sets are randomly sep- 
arated into 60% training and 40% testing. Eor the trained fusion techniques including 
the DT and BE etc., 66.7% of the training data are used to train the PNN classifier and 
the remaining 33.3% used as a validation set to estimate the decision distribution of the 
base classifiers. 
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The experiment results are divided into three groups for discussion: without noise 
(CBF,CC,WF), with label noise (CC,CC+5% CC+10% ) and with data noise (WF, 
WF+Data Noise). Both the CC and WF data sets are included in two different groups to 
show the trend of the performance of various approaches in the presence of the noise. 
There are several observations from the performance of the individual classifier ap- 
proaches including the Raw, DFT and DESh-DG. (1) The first group of experiments 
show that these techniques are data-dependent. For example, the highest MCR for Raw 
is 93.7% in the WF data set while it is as low as 72.5% in CBF. (2) Different represen- 
tation techniques have strength in different data sets. For the CBF, the MCR of DFT is 
12.3% higher than the Raw and 7.8% higher than the DES-hDG. For the CC and WF, 
DESh-DG and Raw has the best performance respectively. This provides a foundation 
for the proposed MCSs to achieve consistent performance over various data sets. (3) 
Different techniques present different level of robustness in the presence of noise. The 
second group of experiments show that DESh-DG is relatively sensitive to label noise 
while the RAW and DFT are more robust. Therefore, it is possible for the MCS to 
achieve robust performance by combining the different methods. For the resampling 
technique, the performance of Bagging is similar to the Raw. This demonstrates that 
the improvement with the resampling technique on the temporal classification may be 
limited. The most interesting comparison is between the NTRPs-BF and the TRPs-BF 
since these two approaches have the same multiple-decision and decision combina- 
tion modules, but TRPs-BF achieves the diversity at the input level through different 
representations. We found that the TRPs-BF outperforms the NTRPs-BF in all of the 
three groups of experiments. In particular, the MCR of TRPs-BF is 13.8 % more than 
the NTRPs-BF in CBF. It shows that the diversity through various input representation 
could be an effective way for temporal classification. The TRPs-DT is even much better 
than the TRPs-BF. In particular, TRPs-DT appears to be more robust to label and data 
noise. For example, when there are 10% of label noise in the CC data, its MCR only 
drops 1.7% compared to the one without label noise. In addition, the TRPs-DT also 
significantly outperforms other methods which are previously discussed in general. 

5 Conclusion 

In this paper, we propose a modular system for the classification of the time series 
data in which the diversity through various input representations is employed. Since the 
patterns identified by the individual classifiers depend on the representation of the input 
data, it is possible that the performance of MCSs could be improved by aggregation 
of the different discriminating patterns. In particular, this approach could be effective 
for temporal classification since the temporal patterns contain dynamic information and 
are difficult to be fully represented by an individual technique. The experimental results 
show that the proposed approach significantly outperforms other methods in general. 
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Abstract. This paper presents a probabilistic model for combining clus- 
ter ensembles utilizing information theoretic measures. Starting from a 
co-association matrix which summarizes the ensemble, we extract a set 
of association distributions, which are modelled as discrete probability 
distributions of the object labels, conditional on each data object. The 
key objectives are, first, to model the associations of neighboring data 
objects, and second, to allow for the manipulation of the defined prob- 
ability distributions using statistical and information theoretic means. 
A Jensen-Shannon Divergence based Clustering Combination (JSDCC) 
method is proposed. The method selects cluster prototypes from the set 
of association distributions based on entropy maximization and max- 
imization of the generalized JS divergence among the selected proto- 
types. The method proceeds by grouping association distributions by 
minimizing their JS divergences to the selected prototypes. By aggre- 
gating the grouped association distributions, we can represent empirical 
cluster conditional probability distributions of the object labels, for each 
of the combined clusters. Finally, data objects are assigned to their most 
likely clusters, and their cluster assignment probabilities are estimated. 
Experiments are performed to assess the presented method and compare 
its performance with other alternative co-association based methods. 



1 Introduction 

Unsupervised classification, or data clustering is an essential tool for exploring 
and searching for groups in unlabelled data. It is a challenging problem because 
the clusters inherent in the data can be of arbitrarily different shapes and sizes. 
A large number of clustering techniques have been developed over the years [1,2]. 
However, many are limited to finding clusters of specific shapes and structures 
and may fail when the data reveal cluster shapes and structures that do not 
match their assumed model. For instance, the K-means which is one of the 
simplest and computationally efficient clustering technique, can easily fail if the 
true clusters inherent in the data are not hyper-spherically shaped. 

Recently, there has been an emergent interest in studying cluster ensembles 
to enhance the quality and robustness of data clustering and to accommodate a 
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wider variety of data types and clusters structures [3-10]. Some of the research 
have relied on using a co-association matrix as a voting medium for finding the 
combined partitioning. Fred and Jain [5], used single link (SLink) hierarchical 
clustering to produce a final partitioning of the data. Cluster-based Similarity 
Partitioning Algorithm (CSPA) is a consensus function introduced by Strehl and 
Ghosh [3] in which graph partitioning is applied to the co-association matrix 
resulting in a consensus partitioning. In [8, 9], we used the co-association matrix 
to construct a Weighted Shared nearest neighbors Graph (WSnnG) and applied 
a weighted version of the same graph partitioning software tool METIS [I I] used 
in CSPA to find the consensus partitioning. 

In this paper we present a probabilistic model derived from the co-association 
matrix, in which each object’s co-associations are modelled as a probability 
distribution, which we refer to as association distribution. The model is pre- 
sented in Section 2. A proposed information theoretic clustering combination 
process, which generates groups of association distributions is presented in Sec- 
tion 3. Groups of association distributions are aggregated through averaging to 
represent empirical cluster conditional probability distributions for each of the 
combined clusters. The model allows for cluster assignment probabilities to be 
estimated. Experimental results and analysis are presented in Section 4. Five 
datasets with various cluster structures are used to evaluate the performance of 
the proposed method at different values of the design parameters and in com- 
parison with alternative methods that operate on the co-association matrix. 

It is noted that diversity among the data clusterings of the ensemble can 
be created in a number of different ways, such as random restarts of a single 
clustering technique, or the use of different clustering techniques, or data re- 
sampling as has been reported recently in [12, 13]. In this paper, we use multiple 
K-means clusterings using random initial restarts. 

2 Probabilistic Model for Cluster Ensembles 

2.1 Problem Formulation 

Let {xi, • • • ,x„} be a set of n data objects described as d-dimensional vectors 
of features. Let {a;i, • • • ,x„} denotes a set of n labels corresponding to each of 
the data objects. Let a cluster ensemble consists of m data clusterings of the 
n d-dimensional data vectors {x;}jLj^. While the clusterings are performed on 
the objects as vectors in their d-dimensional feature space, the modelling and 
combination methods described in this work deals exclusively with object labels, 
rather than objects as feature vectors. Throughout the paper, we will use the 
terms objects and object labels interchangeably to refer to object labels denoted 
by unless it is otherwise specified. 

Let {yi, • • • , Ym} be m n-dimensional labelling vectors representing the data 
clusterings where each clustering of the n objects consists of a number ki of 
clusters. Each entry yij in the vector y^ represents the cluster label, i.e. index of 
the cluster, to which data object Xj is assigned, such that yij G {1, • • • , ki}. In this 
paper, the m clusterings are combined resulting in a clustering y which consists 
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of k* clusters where k* is prespecified. In addition, an n-dimensional probability 
vector is generated which gives estimate of the probability of association of each 
object to its assigned cluster. 

2.2 Association Distributions 

Let S be an n X n co-association matrix which summarizes the generated en- 
semble. Each entry Sy of S represents the ratio of the number of times objects 
Xi and Xj co-occur in the same cluster to the number of clusterings m. 

Let A be a discrete random variable which takes as values the object labels 
{x\, - • ■ ,Xn\- Given an object label Xj, define an association distribution p{x\xi) 
as a probability distribution of the random variable X. We have a total of n asso- 
ciation distributions, given each object label Xi. The probability values assumed 
by p{x\xi) are computed as follows: 

P{X = Xj\xi) = VjG{l,---,n} 

That is, each association distribution is simply computed by normalizing each 
row/column of S. Hence, p{x\xi) satisfies the two conditions: P{X = Xj\xi) > 
0,Vj G {1, • • • ,n}, and = 1- Each data object Xi contributes 

—p{x\xi) to the estimated probability distribution p{x) of X, i.e.. 



p{x) 



^^p{x\xi) 

2=1 



Suppose that the data objects are partitioned into k* disjoint clusters, 
therefore, p{x) can be written as follows: 



k* 

p{x) =^P{Cj)p{x\Cj) 
i=i 

where P{cj) are the probabilities of the clusters and can be estimated by rij/n, 
such that rij is the number of data objects in cluster Cj. The cluster conditional 
probability distribution p(x\cj) of X is estimated by 

p{x\cj) = — 

•1 i=l 

where Xi^ is the object in cluster Cj. That is, p{x\cj) is the average of the Uj 
association distributions p{x\xi^). 

But these clusters and their cluster conditional distributions p{x\cj) 

are unknown and represent exactly what we need to find. Therefore, by deter- 
mining how to group the set of n association distributions, we can compute 
by aggregation through averaging an empirical cluster conditional probability 
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distribution of X for each cluster. Section 3 presents an information-theoretic 
consensus clustering process developed for grouping of association distributions. 

After computing p{x\cj) for all j G {1, • • • , k*}, a final object assignment (or 
re-arrangement) is performed where each object Xi is assigned to its most likely 
cluster Cj which satisfies that P{X = Xi\cj) > P{X = Xi\ci) for I yf j, and 
? G {1, • • • , k*}. 

Hence, each object is assigned to a particular cluster Furthermore, 

estimated assignment probabilities of objects to a cluster cj, P{cj\x) can be 
computed using Bayes rule as follows. 

, , p{x\Cj)P{Cj) p{x\Cj)P(Cj) 
p{x) ~ fe* 

^p{x\ci)P{ci) 

1=1 

Notice that the cluster conditional probability distribution introduced here is 
a discrete probability function defined on the finite space of object labels, rather 
than the feature vectors as conventionally assumed in supervised classification. 

3 The Clustering Combination Process 

In this section we present a heuristic information-theoretic method for grouping 
association distributions into k* clusters, which as discussed in Section 2.2 is 
the basis for empirically estimating a cluster conditional probability distribution 
for each of the combined clusters. We call this method the Jensen-Shannon 
Divergence based Clustering Combination (JSDCC) as it mainly utilizes the 
Jensen-Shannon divergence. 

We will use the short hand notation Pi{x) and Pi{xj) to refer to p{x\xi) and 
P{X = Xj\xi) respectively. For details on the information theoretic measures 
used, the reader is referred to [14]. The Shannon entropy [15] H(pi) measures 
the information content of Pi{x) and is given by 

n 

H{pi) = -'^Pz{xj) log Pi {xj ) 
i=i 

The Jensen-Shannon JS divergence is a measure of distance between the 
probability distributions Pi{x), and Pj{x) and is given by 

js{p,,pj) = H{p,j) - ( 1 ) 

where Pij{x) is the average of the probability distributions Pi{x) and Pj{x). 

The JS is symmetric, bounded [16] and JS{pi,pj) = 0 4=^ Piix) =pj{x), 
Wx. Furthermore, it can be generalized to measure the divergence between any 
finite number r of probability distributions, and allows for weighted averages 
among distributions. The most general form of the JS divergence is given in 
Equation 2 where tt = {7Ti,7r2, • • -Tir} is the set of normalized weights. In this 
paper equal weights are used. 
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J St:{{pi{x)\ 1 <i<r}) = - '^TTiH{pi) (2) 

i=l i=l 

The method starts by selecting k* prototypes T = {ti, ■ ■ ■ ,tk*} from the set 
of association distributions {p{x\xi)}2^i, which will be used as initial represen- 
tatives of the k* clusters. From an information theory point of view, we would 
like to select the most informative prototypes, yet they should be as divergent 
as possible from each others. Hence, information content measured using the 
entropy is used to rank the distributions. But, we cannot select the k* proto- 
types with the largest entropies, because although they can indeed contain a 
great deal of information, it may be the same information. So, we need to also 
choose them so that they are not redundant. Therefore we want to select those 
prototypes that have maximum divergence among themselves, where divergence 
is measured using the generalized Jensen-Shannon divergence given in Equation 
2. In principle, this involves a search over different prototype sets, which is 
an enormous search for large n. Therefore, we use a incremental method which 
assesses prototypes individually based on their entropies and their divergences 
from previously selected prototypes, as described in Algorithm 1. 



Algorithm 1 Selection of Prototypes 
1: ^ argmaxii’(pi) 

2: for all j € {2, ■ ■ ■ ,k*} do 
3: for alH e {1, • • • , n} do 

4: compute Divergence D{pi) ^ as given in Equation 2, 

such that 7T = 1/j 

5: end for 

6: select tj ^ argmax Jf(argmaxD(pi)) 

Pi (a:) Pi (a:) 

7: end for 



Following the selection of k* cluster prototypes, a distribution merging proce- 
dure is used to merge each of the n association distributions with the prototype 
with minimum JS divergence. The procedure is summarized in Algorithm 2. 

4 Experimental Analysis 

Experiments are performed using the K-means clustering algorithm. The number 
of clusterings generated m are 10 and 50. We use fixed values oi ki = k for all 
the m clusterings within an ensemble. Different values of k are used starting 
from the true number of clusters in a dataset and gradually increasing k. The 
final number of combined clusters k* represent the number of true clusters and 
is assumed known. For each combination of k and m, we evaluate the quality of 
combined clusterings generated using SLink, JSDCC, CSPA and WSnnG using 
the commonly used F-measure. We also show the average ensemble’s F-measure. 
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Algorithm 2 Merging of distributions 
1: for all i £ {1, • • • , n} do 
2: for all j e {1, • • • , fc*} do 

3: compute JS{pi,tj) as given in Equation 1 

4: end for 

5: Let t <— arg min JS(pi(x),tj) and merge pdx) with t. 

tjGT 

6: end for 

7: Average merged k* clusters of distributions to estimate {p{x\a)}i^i 
8: Assign objects to their most likely clusters and compute assignment probabilities 
(See Section 2.2) 




(a) 



(b) 




Fig. 1. Data sets (a) 2D2K dataset consists of 2 Gaussian clusters, not linearly sep- 
arable, (b) 2D2C-Non-Spherical dataset has 2 unbalanced non spherical clusters, and 
(c) 2D3C-Strings has 3 string-like clusters of the same size 



We use five dataset to evaluate the performance of the different methods in 
a number of different situations. The first is the 2D2K dataset used in [3] and 
downloaded from http://www.strehl.com/. We use a random sample of 300 data 
points. The dataset is shown in Figure 1 (a) and consists of two Gaussian clusters 
which are not linearly separable. The second dataset is 2D2C-Non-Spherical 
shown in Figure 1 (b), which is artificially generated and consists of 2 ellipsoidal 
clusters with different sizes (200,50), and covariances. The third dataset is 2D3C- 
Strings, shown in Figure 1 (c), and consisting of three ellipsoidal and elongated 
clusters of equal sizes (100,100,100), and different orientations. The fourth is the 
Iris dataset available from the UCI machine learning repository and consists of 
4-dimensional points in 3 clusters, one linearly separable and 2 interleaving. The 
fifth is the Wisconsin diagnostic breast cancer data (WDBC), also available from 
the UCI repository, and consists of 569 instances and 30 numeric attributes, with 
class distribution of 357 benign, 212 malignant 

Notice that the ensemble approach in [5] used varying ki. In [3,8,9] various 
clustering techniques were used to generate the ensembles, and in [9], the vote 
threshold and the number of nearest neighbors were varied. Here, the focus is on 
evaluating their respective underlying combination methods on the ensembles 
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Table 1. F-Measure for the 2D2K data set 







Ensemble’s 




Consensus Functions 


m 


k 


Mean 


k* 


SLink 


JSDCC 


CSPA 


WSnnG 


10 


Y 


0.986 


2 


0.986 


0.986 


0.963 


0.963 


10 


4 


0.731 


2 


0.627 


0.986 


0.980 


0.976 


10 


6 


0.579 


2 


0.666 


0.983 


0.973 


0.973 


10 


8 


0.535 


2 


0.666 


0.979 


0.973 


0.976 


10 


10 


0.497 


2 


0.976 


0.983 


0.960 


0.966 


50 


2 


0.986 


2 


0.986 


0.986 


0.963 


0.963 


50 


4 


0.725 


2 


0.704 


0.983 


0.980 


0.980 


50 


6 


0.582 


2 


0.665 


0.979 


0.980 


0.980 


50 


8 


0.533 


2 


0.665 


0.983 


0.976 


0.976 


50 


10 


0.480 


2 


0.665 


0.979 


0.976 


0.976 


Average 


0.7606 


0.9827 


0.9724 


0.9729 



Table 2. F-Measure for the 2D2C-Non-Spherical 



m 


k 


Ensemble’s 

Mean 




1 Consensus Functions 


SLink 


JSDCC 


CSPA 


WSnnG 


10 


Y 


0.743 


2 


0.769 


0.769 


0.729 


0.725 


10 


4 


0.643 


2 


0.984 


0.976 


0.718 


0.722 


10 


6 


0.546 


2 


0.984 


0.984 


0.729 


0.722 


10 


8 


0.471 


2 


0.725 


0.924 


0.722 


0.725 


10 


10 


0.414 


2 


0.729 


0.608 


0.722 


0.725 


50 


2 


0.716 


2 


0.769 


0.769 


0.729 


0.733 


50 


4 


0.637 


2 


0.984 


0.972 


0.729 


0.729 


50 


6 


0.544 


2 


0.984 


0.984 


0.722 


0.726 


50 


8 


0.474 


2 


0.984 


0.980 


0.733 


0.729 


50 


10 


0.415 


2 


0.984 


0.984 


0.726 


0.729 


Average 


0.8896 


0.8950 


0.7259 


0.7265 



specified by the parameters m and k as shown in the first two columns of the 
results Tables 1, 2, 3, 4, and 5. 

From the results, it is noted that in cases of 2D2K, 2D3C-Strings, Iris, and 
WDBC there is a decline observed in terms of the F-measure with the SLink 
approach which is believed to be due to its inherent chaining effect particularly 
when the clusters are not linearly separable. The CSPA and WSnnG did not 
adapt successfully to the cluster structure of the 2D2C-NonSperical, whereas 
the SLink and JSDCC performed well in most ensembles, by uncovering the 
cluster structure which in this case the K-means (see average at fc = 2) has 
failed to find. The 2D3C-Strings was the hardest on the K-means (see average 
at fc = 3), whereas at some combination of m and k the performance of JSDCC, 
CSPA and WSnnC was relatively good in discovering the cluster structure, yet 
a dependency on k is clearly observed. In the case of the Iris, JSDCC, CSPA 
and WSnnC outperformed SLink, and in case of WDBC, JSDCC was best. 
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Table 3. F-Measure for the 2D3C-Strings Dataset 







Ensemble’s 




Consensus Functions 


m 


k 


Mean 


k* 


SLink 


JSDCC 


CSPA 


WSnnG 


10 


T 


0.652 


3 


0.663 


0.690 


0.7 


0.695 


10 


5 


0.631 


3 


0.752 


0.700 


0.686 


0.722 


10 


7 


0.644 


3 


0.664 


0.669 


0.729 


0.669 


10 


9 


0.561 


3 


0.590 


0.926 


0.959 


0.873 


10 


11 


0.513 


3 


0.576 


0.811 


0.929 


0.896 


50 


3 


0.655 


3 


0.690 


0.690 


0.659 


0.635 


50 


5 


0.628 


3 


0.725 


0.690 


0.635 


0.722 


50 


7 


0.627 


3 


0.751 


0.751 


0.641 


0.736 


50 


9 


0.568 


3 


0.699 


0.924 


0.926 


0.809 


50 


11 


0.511 


3 


0.699 


0.618 


0.953 


0.923 


Average 


0.6809 


0.7469 


0.7817 


0.7680 



Table 4. F-Measure for the Iris dataset 







Ensemble’s 




Consensus Functions 


m 


k 


Mean 


k* 


SLink 


JSDCC 


CSPA 


WSnnG 


10 


T 


0.872 


3 


0.891 


0.891 


0.853 


0.839 


10 


5 


0.759 


3 


0.758 


0.831 


0.966 


0.946 


10 


7 


0.708 


3 


0.758 


0.890 


0.96 


0.919 


10 


9 


0.637 


3 


0.758 


0.831 


0.78 


0.826 


10 


11 


0.590 


3 


0.758 


0.831 


0.80 


0.753 


10 


13 


0.520 


3 


0.771 


0.973 


0.973 


0.886 


50 


3 


0.846 


3 


0.891 


0.891 


0.853 


0.839 


50 


5 


0.755 


3 


0.831 


0.897 


0.919 


0.893 


50 


7 


0.684 


3 


0.758 


0.831 


0.973 


0.946 


50 


9 


0.622 


3 


0.758 


0.966 


0.979 


0.693 


50 


11 


0.573 


3 


0.758 


0.897 


0.979 


0.973 


50 


13 


0.525 


3 


0.758 


0.973 


0.973 


0.686 


Average 


0.7873 


0.8918 


0.9173 


0.8499 



Table 5. F-Measure for the WDBC Data 



m 


k 


Ensemble’s 

Mean 


k* 


Consensus Functions 


SLink 


JSDCC 


CSPA 


WSnnG 


10 


2 


0.844 


2 


0.844 


0.844 


0.675 


0.635 


10 


3 


0.799 


2 


0.890 


0.890 


0.818 


0.802 


10 


4 


0.764 


2 


0.682 


0.779 


0.839 


0.792 


10 


5 


0.679 


2 


0.683 


0.855 


0.694 


0.692 


10 


6 


0.594 


2 


0.683 


0.877 


0.840 


0.844 


50 


2 


0.844 


2 


0.844 


0.844 


0.675 


0.635 


50 


3 


0.799 


2 


0.743 


0.890 


0.815 


0.801 


50 


4 


0.764 


2 


0.682 


0.821 


0.797 


0.792 


50 


5 


0.678 


2 


0.683 


0.855 


0.700 


0.692 


50 


6 


0.588 


2 


0.683 


0.854 


0.844 


0.675 


Average 


0.7417 


0.8509 


0.7697 


0.7360 
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The average performance over all the ensemble configurations shown for each 
dataset is as follows. In the case of the 2D2K dataset, JSDCC improves over the 
lowest performing combination method by 29%. In the case of the 2D2C-Non- 
Spherical, JSDCC was best and improves over the lowest by 23% and by 20% 
over the K-means at the true number of cluster (k=2). In the case of 2D3C- 
Strings, it is lower by 5% than the higest performing method and improves over 
the lowest by 8% and by 14% over the K-means at true number of clusters (k=3). 
In the case of the Iris data, it is 2% lower than the highest and improving by 
14% over the lowest. Finally, in the case of WDBC, it is the best, and improves 
over the lowest combination method by 16%. 

5 Discussion and Conclusion 

The co-association matrix represents a voting medium allowing the generation 
of consensus clustering. In this paper, we proposed a probabilistic model based 
on the co-association matrix in which the definition of association distributions 
allowed for the generation of empirical cluster conditional association distribu- 
tions for each of the combined clusters. The model allowed the evaluation of 
the probabilities with which objects belong to the each cluster. The key objec- 
tives of developing this model are to represent the associations of neighboring 
data objects and to allow for the manipulation of their association distribu- 
tions using probabilistic and information theoretic tools. Noticeably, the model 
expresses of the same idea of shared nearest neighbors [17, 8, 9], since the associ- 
ation distribution is indeed a function of each object’s co-associated neighbors. 
A Jensen-Shannon divergence based Clustering Combination (JSDCC) method 
was developed for grouping of association distributions. 

The JSDCC method has a quadratic computational complexity (0(n^)), 
since the computation of the JS divergence is 0{n). Future work will focus on 
optimizing the approach for scalability to large datasets. For instance, we can 
cut down the number of distance computations by processing all the objects that 
has high Sy values with selected prototypes. That is, we can use both the Sy 
values and divergences jointly, leaving divergence computation for less obvious 
cases. Although the space complexity of the co-association matrix itself is O(n^), 
it is possible in future work to limit the number of co-associated objects to the 
K-nearest neighbors which will reduce the space requirement to 0{Kn). 
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Abstract. This paper describes a method for fusing a collection of classifiers 
where the fusion can compensate for some positive correlation among the clas- 
sifiers. Specifically, it does not require the assumption of evidential independ- 
ence of the classifiers to he fused (such as Dempster Shafer’s fusion rule). The 
proposed method is associative, which allows fusing three or more classifiers ir- 
respective of the order. The fusion is accomplished using a generalized intersec- 
tion operator (T-norm) that better represents the possible correlation between 
the classifiers. In addition, a confidence measure is produced that takes advan- 
tage of the consensus and conflict between classifiers. 



1 Introduction 

Design of a successful classifier fusion system consists of two important parts: design 
of the individual classifiers, selection of a set of classifiers [9, 17], and design of the 
classifier fusion mechanism [14]. Key to effective classifier fusion is the diversity of 
the individual classifiers. Strategies for boosting diversity include: 1) using different 
types of classifiers; 2) training individual classifiers with different data set (bagging 
and boosting); and 3) using different subsets of features. 

In the literature we can find many fusion methods derived from different underly- 
ing frameworks, ranging from Bayesian probability theory [18], to fuzzy sets [5], 
Dempster-Shafer evidence theory [1], group decision-making theory (majority voting 
[10], weighted majority voting, and Borda count [7]). Existing fusion methods typi- 
cally do not address correlation among classifiers’ output errors. For example, meth- 
ods based on Dempster-Shafer theory assume evidential independence [16, p. 147]. 

In real-world applications, however, most individual classifiers exhibit correlated 
errors due to common data sources, domain knowledge based filtering, etc. Typically, 
they all tend to make the same classification errors on the most difficult patterns. 
Prior research has focused on relating the correlation in the classifiers’ errors to the 
ensemble’s error [17], quantifying the degree of the correlation [8, 13], and minimiz- 
ing the correlation degree between the classifiers [11]. However, not much attention 
has been devoted to investigating fusion mechanisms that can compensate for the 
impact of the classifiers’ error correlations. 

This paper investigates the impact of parameter changes within a T-norm based 
framework for a set of n classifiers. Rather than using pairwise correlations of classi- 
fiers’ errors as a method to select a subset of classifiers, we focus on the overall be- 
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havior of the fusion. We use a single parameter in an attempt to provide a uniform 
compensation for the correlation of the classifiers’ errors. We explore the impact of 
such compensation on the final classification by using four performance metrics. We 
empirically obtain the value of the parameter that gives us the best solution in this 
performance space. 



2 Fusion via Triangular Norms 

We propose a general method for the fusion process, which can be used with classifi- 
ers that may exhibit any kind of (positive, neutral, or negative) correlation with each 
other. Our method is based on the concept of Triangular Norms, a multi-valued logic 
generalization of the Boolean intersection operator. To justify the use of the intersec- 
tion operator it can be argued that the fusion of multiple decisions, produced by mul- 
tiple sources, regarding objects (classes) defined in a common framework (universe 
of discourse) consists in determining the underlying of degree of consensus for each 
object (class) under consideration, i.e., the intersections of their decisions. With the 
intersections of multiple decisions one needs to account for possible correlation 
among the sources, to avoid under- or over-estimates. Here we explicitly account for 
this by the proper selection of a T-norm operator. 

We combine the outputs of the classifiers by selecting the generalized intersection 
operator (T-norm) that better represents the possible correlation between the classifi- 
ers. With this operator we will intersect the assignments of the classifiers and com- 
pute a derived measure of consensus. Under certain conditions, defined in section 
2.2, we will be able to perform this fusion in an associative manner, e.g., we will 
combine the output of the fusion of the first two classifiers with the output of the third 
classifier, and so on, until we have used all available classifiers. At this stage, we can 
normalize the final output (showing the degree of selection as a percentage), identify 
the strongest selection of the fusion, and qualify it with its associated degree of confi- 
dence. 



2.1 Triangular Norms 

Triangular norms (T-norms) and Triangular conorms (T-conorms) are the most gen- 
eral families of binary functions that satisfy the requirements of the conjunction and 
disjunction operators, respectively. T-norms T(x,y) and T-conorms S(x,y) are two- 
place functions that map the unit square into the unit interval, i.e., 
r(v,y);[0,l]x[0,l]^[0,l] and 5'(v:,y): [0,l]x[0,l]^ [0,l]- They are monotonic, 

commutative and associative functions. Their corresponding boundary conditions, 
i.e., the evaluation of the T-norms and T-conorms at the extremes of the [0,1] inter- 
val, satisfy the truth tables of the logical AND and OR operators. They are related by 
the DeMorgan duality, which states that if N(x) is a negation operator, then the T- 
conorm S(x,y) can be defined as S(x,y) = N(T(N(x), N(y))). 
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In Bonissone and Decker [4], six parameterized families of T-norms and their dual 
T-conorms were discussed and analyzed by the authors. Of the six parameterized 
families, one family was selected due to its complete coverage of the T-norm space 
and its numerical stability. This family, originally defined in [15], has a parameter p 
that spans the space of T-norms. By selecting different values of p we can instantiate 
T-norms with different properties. See [3, 15] for more detailed information. 

2.2 Determination of a Combined Decision 

Let us define m classifiers Sj, ... S„,, such that the output of classifier is the vector V 
showing the normalized decision of such classifier to the N classes. In this representa- 
tion, the last {N+l f' element represents the universe of all classes, U, and is used to 
indicate the classifier’s lack of commitment, i.e. no-decision: 

=[ r(l), r(2), r(N+l)], where/^(i)£ [0,1] subject to: = l 

We define the un-normalized fusion of the outputs of two classifiers Sj and as: 

= Extraction^uterproduct{l' , ,T)\= Extraction[A] (1) 

where the outer-product is a well-defined mathematical operation, which takes as 
arguments two N-dimensional vectors /^and f and generates as output the NxN di- 
mensional array A. Each element A(i,j) is the result of applying the operator T to the 



corresponding vector elements, namely 7^(i) and/(j), i.e.: 

A(i.j} = T[l‘(i),f(j)]. (2) 

The Extraction operator recovers the un-normalized output in vector form: 

Extraction [a]=[/’(1),/’(2),...,/’(A + D] (3) 

where : r(i) = A(i,i)+ A{i,N + l)+ A(N + l,i) fori = l,N (4) 

and /’(A-bl) = A(A-bl,A-bl) (5) 



The extraction operator leverages the fact that all classes are disjoints. A justifica- 
tion for equations (4) and (5) can be found in the case analysis illustrated in eq. (7). 

There is an infinite number of T-norms. Therefore, to cover their space we used 
the family proposed by Schweizer and Sklar [15], in which each T-norm is uniquely 
defined by a real valued parameter p. We selected the six most representative ones 
for some practical values of information granularity, as listed in Table 1. 



Table 1. Example of Outer Product using as operator the function T(x,y). 



T-norm 


Value of p 


Correlation Type 


7[ (x, y) = max (0,x-l- y- 1) 


p = -l 


Extreme case of negative correlation 


Tj =Max(0,x“" + y“"-l)^ 


p = -0.5 


Partial case of negative correlation 


T^=x*y 


P^O 


No correlation 




d 

II 


Mildly Positive Correlation 


T,ix,y) = (x-' + y-'-l)-' 


p=i 


Partial case of positive correlation 


= min(x, y) 


/? — > 00 


Extreme case of positive correlation 
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The selection of the best T-norm to be used as intersection operation in the fusion 
of the classifiers depends on the potential correlation among their errors. (the 
minimum operator) should be used when one classifier subsumes the other one {ex- 
treme case of positive correlation). T, should be selected when the classifiers errors 
are uncorrelated (similar to the evidential independence in Dempster-Shafer). Tj 
should be used if the classifiers are mutually exclusive {extreme case of negative 
correlation). The operators T^, T^, and T, should be selected when the classifiers 
errors show intermediate stages of negative (T^) or positive correlation (T^ and T,). 
Of course, other T-norms could also be used. These five T-norms are simply good 
representatives of the infinite number of functions that satisfy the T-norm properties. 

From the associativity of the T-norms we can derive the associativity of the fusion 

F{l\F{l\l^)) = F{F{I\I^),I^) ( 6 ) 

providing that equation (4) only contains one term. This is the case when: a) there is 
no lack of commitment in the classifiers {A{i,N+l )=A(N+l,i)=Ofor i=l,..., Af|; b) the 
lack of commitment is complete in one classifiers [A(i,i)=0 for i=l,..., N and either 
A(i,N+l)=f(i) & A(N+l,i)=0 or A(N+l,i)= f(i) & A(i,N+l )=0 ]; c) the lack of com- 
mitment is complete in both classifiers [eq(4) =0 & eq(5) =1 ]. In any other case, the 
preservation of associativity requires the distributivity of the T-norm over the addi- 
tion operator of equation (4). This distributivity is satisfied only by T-norm 7) (scalar 
product). For any other T-norm the lack of associativity means that the outer-product 
A defined by equation (2) must be computed on all dimensions at once, using as op- 
erator a dimensionally extended Schweizer & Sklar T-norm: 

= -k + )\ if{p>0)or{p<0)and^''.^^x;p]>{k-l) 

Given our premise that the classes are disjoint, we have four possible situations: 

(a) when i=j and i< (N+1) then r. n r = r.n r.= r. (7a) 

(b) when i=j and i= (N+1) then r. n r= U (the universe of classes) (7b) 

(c) when i?)j and i< (N+1 ) and j < (N+1) then r. n r.= (^ (the empty set) (7c) 

(d) when i^^j and either i= (N+1 ) then Un r= r or j= (N+1) then r.nU= r. (7d) 

Figure 1 illustrates the result of the intersections of the classes and the universe U. 



: (c) 
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Fig. 1. Intersection of disjoint classes and the universe U. 
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Note that when DS theory is applied to subsets containing only singletons or the 
entire universe of classes U, its fusion rule becomes a special case of our proposed 
methodology. 

2.3 Measure of Confidence 

In cases where classes are ordered, we can compute the confidence in the fusion by 
defining a measure of the scattering around the main diagonal of the confusion ma- 
trix. The more weights are assigned to elements outside the main diagonal, the less is 
the measure of the consensus among the classifiers. We can represent this concept by 
defining a penalty matrix P ={P(i,jJ\, of the form: 

[max(0,(l-VP*|l- /I))'' forl<i<Nand 1< j<N (8) 

T(!, j) = { 

[l fori = (N + l)orj = (N-H) 

This function rewards the presence of weights on the main diagonal, indicating 
agreement between the two classifiers, and penalizes the presence of elements off the 
main diagonal, indicating conflict. The conflict increases in magnitude as the dis- 
tance from the main diagonal increases. For example, /or W=0.2 and d=5 we have the 
following penalty matrix: 




Fig. 2. Penalty matrix P for W=0.2 and d=5. 



Of course any other function penalizing elements off the main diagonal (any suit- 
able non-linear function of the distance from the main diagonal, i.e. the absolute 
value |i-j|) could also be used. The reason for this penalty function is that conflict is 
gradual, since the classes have an ordering. Therefore, we want to capture the fact 
the discrepancy between classes r, and r^ is smaller than the discrepancy between r^ 
and r^ The shape of the penalty matrix P captures this concept, as P shows that the 
confidence decreases non-linearly with the distance from the main diagonal. 

A measure of the normalized confidence C is the sum of element-wise products 
between A andP, e.g.: 

N+lN+l 

C = Normalized Confidence (A, P) = ^ A(i, ])* P{i,j) 

i=i j=i 

where A is the normalized fusion matrix. 

We could interpret the confidence factor C as the weighted cardinality of the 
normalized assignments around the main diagonal, after all the classifiers have been 
fused. 
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In the special case of Dempster Shafer, the measure of confidence C is the com- 
plement (to one) of the measure of conflict K, i.e.: C = 1- K, where K is the sum of 
weights assigned to the empty set. We would like to point out that the fusion rule 
based on Dempster-Shafer corresponds to the selection of: 

(a) T-norm operator T(x,y) = x*y 

(b) Penalty function using d = 

Constraint b) implies that the penalty matrix P will be: 




Fig. 3. Penalty matrix P for the special case of DS confidence computation: d=~. 

Therefore the two additional constraints a) and b) required by Dempster-Shafer 
theory imply that 

(a) The classifiers to be fused must be uncorrelated (evidentially independent) 
and 

(b) There is no ordering over the classes, and any kind of disagreement 
(weights assigned to elements off the main diagonal) can only contribute to 
a measure of conflict and not (at least to a partial degree) to a measure of 
confidence. In DS, the measure of conflict K is the sum of weights assigned 
to the empty set. This corresponds to the elements with a 0 in the penalty 
matrix P illustrated in Figure 3. 

Clearly, our proposed method is more general, since it does not require these addi- 
tional constraints, and includes the DS fusion rule as a special case. 

2.4 Decisions from the Fusion Process 

The decisions for each class can be gathered by adding up all the weights assigned to 
them, as indicated in equations (4) and (5). 

3 Examples 

3.1 Example (1): Agreement between the Classifiers without Discounting 

Let 7' =[0.8, 0.15, 0.05, 0, 0, 0] and f =[0.9, 0.05, 0.05, 0, 0, 0[ 

This indicates that both classifiers are showing a strong preference for the first 
class as they are assigned 0.8 and 0.9, respectively. We fuse these classifiers using 
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each of the T-norm operators defined above and normalize the results so that the sum 
of the entries is equal to one. Note that during the process we will use the un- 
normalized matrices A to preserve the associative property. Using the expressions for 
weights of a class, we can compute the final weights for the N classes and the uni- 
verse. Table 2 shows the results of the fusion of classifiers SI and S2, using each of 
the five T-norms, with the associated normalized confidence measure. 



Table 2. Fusion of two classifiers and associated confidence measure with agreement. 





Cl 


C2 


C3 


C4 


C5 


U 


Norm. 

Conf. 


7; = min(v,y) 


0.615 


0.038 


0.038 


0 


0 


0 


0.774 




0.633 


0.034 


0.022 


0 


0 


0 


0.770 


f = x*y 


0.720 


0.008 


0.003 


0 


0 


0 


0.797 


f = Max(0, + y°-^ - if 


0.807 


0.000 


0.000 


0 


0 


0 


0.858 


Tj = max (0, X -f y - 1) 


0.933 


0.000 


0.000 


0 


0 


0 


0.955 



3.2 Example (2): Disagreement between the Classifiers, without Ignorance 

In a situation in which there is a discrepancy between the two classifiers, this fact will 
be captured by the confidence measure. For instance, let’s assume that the two classi- 
fiers are showing strong preferences for two different classes, e.g. 

l‘ =[0.15, 0.85, 0.05, 0, 0, 0] and =[0.9, 0.05, 0.05, 0, 0, 0] 

The results of their fusion are summarized in Table 3. None of the classes has a 
high weight and the normalized confidence has dropped, with respect to Table 2. 



Table 3. Fusion of two classifiers and related confidence measure with disagreement. 





Cl 


C2 


C3 


C4 


C5 


u 


Norm. 

Conf. 


T; = min(x,y) 


0.115 


0.038 


0.038 


0 


0 


0 


0.438 


r,(x,y) = (x-‘ + y-‘-l)-‘ 


0.127 


0.043 


0.022 


0 


0 


0 


0.438 


X* y 


0.135 


0.040 


0.003 


0 


0 


0 


0.434 


f=Max{Q,x°-^ + y°-"-\f 


0.128 


0.016 


0.000 


0 


0 


0 


0.416 


Tj = max (O, X -f y - 1) 


0.067 


0.000 


0.000 


0 


0 


0 


0.373 



4 Application and Results 

Insurance underwriting is a classification problem. The underwriting (UW) process 
consists of assigning a given insurance application, described by its medical and 
demographic records, to one of the risk categories (referred to as rate classes). Tradi- 
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tionally, highly trained human individuals perform insurance underwriting. The 
granularization of risk into rate classes is dictated by industry regulation and by the 
impossibility of human underwriters to achieve consistency of decisions over real- 
valued risk granularity. A given application for insurance is compared against stan- 
dards put forward by the insurance company and classified into one of the risk cate- 
gories available for the type of insurance requested by the applicant. The risk catego- 
ries then affect the premium paid by the applicant; the higher the risk category, the 
higher the premium. Such manual underwriting, however, is not only time- 
consuming, but also often inadequate in consistency and reliability. The inadequacy 
becomes more apparent as the complexity of insurance applications increases. The 
automation of this decision-making process has strong accuracy, coverage and trans- 
parency requirements. There are two main tradeoffs to be considered in designing a 
classifier for the insurance underwriting problem: 1) Accuracy versus coverage - 
requiring low misclassification rates for high volume of applications; 2) Accuracy 
versus interpretability - requiring a transparent, traceable decision-making process. 
Given the highly non-linear boundaries of the rate classes, continuous approaches 
such as multiple regressions were not successful and a discrete classifier was de- 
ployed for production. 

Once this process is automated by a production engine, we need to perform off-line 
audits to monitor and assure the quality of the production engine decisions. Auditing 
automated insurance underwriting could be accomplished by fusing the outputs of 
multiple classifiers. The purpose of this fusion, then, is to assess the quality of the 
decisions taken by a production engine. In addition, this fusion will identify the best 
cases, which could be used to tune the production engine in future releases, as well as 
controversial or unusual cases that should be highlighted for manual audit by senior 
underwriters, as part of the Quality Assurance process. 

We tested the approach against a case base containing a total of 1,875 cases, which 
could be assigned to one of five rate-classes. The fusion was performed using differ- 
ent T-norms. The classifiers to be fused were a classification and regression tree 
(MARS), a bank of binary MVP neural (NN) nets using back-propagation as the 
learning algorithm, a cased-based reasoning engine (CBE), and a bank of binary sup- 
port vector machines (SVM). Table 4 shows the performance of the individual classi- 
fiers as expressed by the true positive rate. 



Table 4. Classifier Performance. 



Classifier 



True Positive Rate 



MARS 

NN 

CBE 

SVM 



94.72% 

94.08% 

91.52% 

94.51% 



4.1 Experiments Results Using Different T-Norms 

Table 5 shows the results of using three Tnorms as the outer-product operators (using 
the same set of 1,875 non-nicotine users). When using T3 {scalar product) combined 
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with a strict determination of the conflict (e.g. for d=°° in the computation of penalty 
matrix P), we have an aggregation analogous to that of Dempster-Shafer’s rule of 
combination. This aggregation relies on the evidential uncorrelation of the sources, a 
constraint that is not easy to satisfy in real situations. In our case, we used a pre- 
processing stage (Tagging) to annotate the input data with domain knowledge. Since 
this pre-processing stage is applied to a common set of inputs for all models, it is 
possible to introduce a common bias; hence the models could exhibit partial positive 
correlation. We sampled the space of T-norms corresponding to positive correlation 
and we found that T4 is the Pareto-best to account for this positive correlation, ac- 
cording to the metrics in Table 5. We also compared the T-norm fusion against a 
baseline fusion using an averaging scheme. The strict or liberal errors used in Table 5 
are extensions of type 1 and type 2 errors in a classification matrix of the rate classes. 



Table 5. Performance of Average and Fusion (using three T-norms). 



Metrics 


Average 


Fusion using 
T3 ip ^0) 


Fusion using 
T4 (p = 0.5) 


Fusion using T5 
(P=l) 


True Positive Rate 


94.24% 


94.08% 


94.72% 


94.08% 


Incorrect Rate (too strict) 


3.47% 


3.47% 


3.09% 


3.47% 


Incorrect Rate (too liberal) 


2.29% 


2.35% 


2.19% 


2.40% 


No Decisions 


0% 


0.11 % 


0% 


0.05 % 



5 Conclusions 

We introduced a fusion mechanism based on an outer-product of the outputs, using a 
Triangular norm as the operator. The output of the fusion was a class distribution and 
a measure of conflict among all the classifiers. We normalized the final output, identi- 
fied the strongest selection of the fusion, and qualified the decision with an associated 
confidence measure. We considered the output of each classifier as a weight assign- 
ment, representing the (un-normalized) degree to which a given class was selected by 
the classifier. As in DS theory, the assignment of weights to the universe all rate- 
classes represented the lack of commitment to a specific decision. 

We applied the fusion to the quality assurance (QA) problem for automated un- 
derwriting. When the degree of confidence of a fused decision was below a lower 
confidence bound, that case became a candidate for auditing. When the degree of 
confidence of a fused decision was above an upper confidence bound, that case be- 
came a candidate for augmenting the Standard Reference Decision set (SRD). We 
tested the fusion module with data sets for nicotine and non-nicotine users. In an 
analysis of 1,875 non-nicotine applications, we generated a distribution of the quality 
of the production engine. By focusing on the two tails of such distribution, we identi- 
fied ~9% of the most reliable decisions and 4.9% of the least reliable ones. 

The proposed fusion module plays a key role in supporting the quality assurance of 
the knowledge-based classifier, by selecting the most questionable cases for auditing. 
At the same time, the fusion module supports the standard reference decision lifecy- 
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cle by identifying the most reliable cases used for updating it [4]. Our next step will 
be to use backward search, guided by a diversity measure [12,14], to select the small- 
est subset of classifiers to be deployed in the production version of this QA process. 
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Abstract. In this paper we adapt the recently proposed Dynamic Integration en- 
semble techniques for regression problems and compare their performance to 
the base models and to the popular ensemble technique of Stacked Regression. 
We show that the Dynamic Integration techniques are as effective for regres- 
sion as Stacked Regression when the base models are simple. In addition, we 
demonstrate an extension to both Stacked Regression and Dynamic Integration 
to reduce the ensemble set in size and assess its effectiveness. 



1 Introduction 

The purpose of ensemble learning is to build a learning model which integrates a 
number of base learning models, so that the model gives better generalization per- 
formance on application to a particular data-set than any of the individual base mod- 
els [3]. Ensemble learning consists of two problems; ensemble generation: how does 
one generate appropriate base models? and ensemble integration: how does one inte- 
grate the base models’ predictions to improve performance? Ensemble generation can 
be characterized as being homogeneous if each base learning model uses the same 
learning algorithm or heterogeneous if the base models can be built from a range of 
learning algorithms. Ensemble integration can be addressed by either one of two 
mechanisms, either the predictions of the base models are combined in some fashion 
during the application phase to give an ensemble prediction {combination/fusion 
approach) or the prediction of one base model is selected according to some criteria 
to form the final prediction (selection approach) [9]. 

Theoretical and empirical work has shown the ensemble approach to be effective 
with the proviso that the base models are diverse and sufficiently accurate [3]. These 
measures are however not necessarily independent of each other. If the prediction 
error of all base models is very low, then their learning hypothesis must be very simi- 
lar to the true function underlying the data, and hence they must of necessity, be 
similar to each other i.e. they are unlikely to be diverse. In essence then there is often 
a trade-off between diversity and accuracy [2]. 

There has been much research work on ensemble learning for regression in the 
context of neural networks, however there has been less research carried out in terms 
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of using homogeneous ensemble techniques to improve the performance of simple 
regression algorithms. In this paper we look at improving the generalization perform- 
ance of nearest neighbours (k-NN) and least squares linear regression (LR). These 
methods were chosen as they are simple models with different approaches to learning 
in that linear regression is an eager model which tries to approximate the true func- 
tion by a global linear function and k-nearest neighbours is a lazy model which tries 
to approximate the true function locally. 

2 Ensemble Integration 

The initial approaches to ensemble combination for regression were based on the 
linear combination of the base models according to the function: 

n 

( 1 . 1 ) 

i=l 

where CC- is the weight assigned to the base models prediction f^{x) . The simplest 

approach to determining the values of Cf,. is to set them to the same value. This is 

known as the Base Ensemble Method (BEM). More advanced approaches try to set 
the weights so as to minimize the mean square error of the training data. Merz and 
Pazzani [12] provide an extensive description of these techniques. 

Model selection simply chooses the best “base” model to make a prediction. This 
can be either done in a static fashion using cross validation majority [15] where the 
best model is the one that has the lowest training error. Alternatively it can be done in 
a dynamic fashion [4,11,13] where based on finding “close” instances in the training 
data to a test instance, a base model is chosen which according to certain criteria is 
believed will give the best prediction. The advantage of this approach is based on the 
rationale that one model may perform better than other learning models in a localised 
region of the instance space even if, on average over the whole instance space, it 
performs no better than the others. 

An alternative strategy to model integration is to build a meta-model to se- 
lect/combine the outputs from base models. The original and most widely used meta- 
technique is referred to Stacking. Stacking was introduced by Wolpert [18] and was 
shown theoretically by LeBlanc and Tibshirani [9] to be a bias reducing technique. In 
Stacked Regression (SR), the base models produce meta-instances consisting of the 
target value and the base models’ predictions, created by running cross validation 
over the training data. The meta-data is used to build a meta-model, based on a re- 
gression algorithm and the base models are built using the whole training data. En- 
semble prediction for a test instance is formed by passing a meta test instance 
(formed from the base models’ predictions) to the meta-model. Typically the genera- 
tion of the base models is heterogeneous or homogeneous but built with different 
training parameters. Breiman [1] investigated the use of Linear Regression to form 
the meta-model and found that Linear Regression is a suitable meta-model so long as 
the coefficients of regression are constrained to be non-negative. 
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More recent meta-approaches for classification are the Dynamic Integration tech- 
niques developed by Puuronen and Tsymbal [13,16] Similar to Stacking, these per- 
form a cross-validation history during the training phase. However meta-instances 
are formed consisting of the training instance attribute values and the error for each 
model in predicting its target value. During the test phase a lazy meta-model based on 
weighted nearest neighbours uses the meta-data to either dynamically select or com- 
bine models for a test instance in the application phase. In the Methodology section 
we describe in detail the DI techniques and the modifications required to make them 
applicable for regression. In this paper, we compare the accuracy of ensemble tech- 
niques of SR and DI over a range of data-sets. It is particularly apposite to compare 
SR to the variants of DI as there strong similarities in their approach in that they ac- 
cumulate meta-data based on a cross validation history which is then used to build a 
meta-model. 



3 Methodology 

In this section we describe the DI classification algorithms and their regression vari- 
ants. DI consists of 3 techniques Dynamic Selection, Dynamic Voting and Dynamic 
Voting with Selection. We refer to their regression counterparts as Dynamic Selec- 
tion, Dynamic Weighting and Dynamic Weighting with Selection. Dynamic Selection 
makes a localized selection of a model based on which model has the lowest cumula- 
tive error for the nearest neighbours to the test instance. The procedure for regression 
remains the same. Dynamic Voting assigns a weight to each base model based on its 
localized performance on the NN set and the final classification is based on weighted 
voting. Dynamic Weighting (DW) is similar to the Dynamic Voting in its calculation 
of weights but the final prediction is made by summing each of the base models pre- 
dictions weighted by a normalized weight value. Dynamic Weighting with Selection 
(DWS) is a regression derivative of Dynamic Voting with Selection. The process is 
similar to Dynamic Weighting except that base model with cumulative error in the 

upper half of the error interval, — £'™" ) / 2 , (where is the largest 

cumulative error of any model and is the lowest cumulative error of any model) 
are discarded from adding to the prediction. 

In [16] the ensemble generation is improved upon by a feature selection method 
based on hill climbing. In this paper, we consider a method that tries to reduce the 
size of the ensemble set whilst maintaining its accuracy. During the training phase, 
we start with a set of N base models, and due to a consideration of their training 
accuracy and diversity determined by the cross-validation process ( intrinsic to both 
the SR and the DI techniques) reduce the size of the set down to M base models. 
This process of filtering down the number of models is based on the pseudo-code 
described in Figure 1 and adds little algorithmic overhead to the techniques. Its goal 
is to remove members from the ensemble that are considered too inaccurate to be 
effective and then to consider the remaining members based on both their accuracy 
and diversity. 
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£ (x) = training error for instance x in training data 
^ ^ E. (x) - total training error for model i 

x€ Train 

- minimum total training error for any model 
E 

min 

accuracy. = 

' 

for (i=l to N) 

if (accuracYi > accuracy_threshold) then 
discard model I 

endif 
endf or 

n = the number of models remaining in the ensemble 
//models are re-indexed from l..n 
if n>M then 

// determine diversity 
for ( i = 1 to n) 
count = 0 
for (j=l to n) 

if ( ioj AND correl(E_,E _) > 0.6) then 
count = count 1 ; 

endif 

endf or 

diversityi= (n-count) /n 
acc-i-divi= accuracy^-i-diversityi 
endf or 

// take the top M base models based on those having highest acc-i-div. 
// measure 



Fig. 1. Ensemble size reduction technique. 



4 Experimental Setup 

The base models and ensemble techniques were assessed using 10 fold cross valida- 
tion and the mean absolute error (MAE) was recorded for each technique. 15 data- 
sets were chosen from the WEKA repository [18]. The data-sets were chosen as they 
represent real world data and not artificial regression examples. The data-sets were 
pre-processed to remove missing values using a mean or modal technique. The two 
base models used were 5-NN and Linear Regression. We assessed the improvement 
in accuracy or otherwise of the ensemble in comparison to the base model by using a 
two tailed paired t-test (p=0.05). Eor each technique, the ensemble set was generated 
using the Random sub-space method (RSM) first proposed by Ho [5,6] for classifica- 
tion problems and is a derivative of Stochastic Discrimination [7]. Random sub-space 
method is a generation method where each base model is built from the training data 
which has been transformed to contain different random subsets of the variables. We 
chose the model tree technique M5, which combines instance based learning with 
regression trees [14] as the meta-model for SR. We chose this as it has a larger hy- 
pothesis space than simple linear regression. In the experiments where the ensemble 
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size was reduced the initial ensemble set had size N = 25 and was reduced to 
M =10 with an accuracy_threshoid of 0.66. Each of the DI techniques used dis- 
tance weighted 5-NN as their meta-model. 



5 Experimental Results 

This section is divided into two sections where each section consists of two experi- 
ments; the first is related to the accuracy of the ensembles for the whole ensemble set; 
the second assesses the accuracy of the ensemble to the experiments when the ensem- 
bles are reduced in size. Each section consists of the results of the comparison with 
the base model of LR and 5-NN respectively. The results of each experiment over the 
15 data-sets is presented in the form of a table where the first column gives the name 
of the data-set, the second column the base models’ MAE ± standard deviation for 
each data-set, and column 3-6 gives the MAE for each ensemble technique. The re- 
maining column records the technique with the least MAE, if any of the techniques 
were able to significantly improve upon the performance of the base model, other- 
wise the entry is left blank. An ensemble MAE result which is significantly better 
than the base model is shown in bold, if it is significantly worse it is shown under- 
lined. An adjunct table summarizes the results of the significance comparison in the 
form of wins/ draws/losses where wins is the number of data-sets where the ensemble 
outperformed the base model, draws is the number of data-sets for which the base 
model showed no significant difference in accuracy to the base model, and losses is 
the number of data-sets where the ensemble accuracy was worse than the base model. 



5.1 Whole Ensemble Set 

This section refers to experimental results involving the whole ensemble set. Table 1 
shows the results of the comparison when the base model was Linear Regression. DS 
and DWS reduced the error significantly for the greatest number of data-sets whereas 
DW reduced the error for the least number. However both SR and DS increased MAE 
significantly for two data-sets. Only DWS never increased the MAE significantly for 
any of the 15-data-sets. If one considers the “least MAE” column it is clear than for 7 
data-sets none of the techniques were effective. For the other 8 data-sets, if we rank 
the order in which the ensemble technique gave the least error most frequently, DS 
came first with SR second. 

Table 2 shows the results of the comparison of ensembles when the base model 
was 5-NN. Clearly the two outstanding ensemble techniques were DW and DWS, 
which both reduced the error signicantly for 13 out of the 15 data-sets. The technique 
which proved least effective was DS. The “least error“ column shows that for every 
data-set at least one of the ensemble techniques was effective in signifcantly reduc- 
ing the error. DWS came first in rank order of the techniques which gave the least 
error most frequently with DW coming second. 
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Table 1. The comparison of ensembles using LR as the base model. 



Data-set 


LR 


SR 


DS 


DW 


DWS 


Least MAE 


abalone 


1.58±0.08 


1.61±0.28 


1.58±0.09 


1.62±0.11 


1.58±0.10 


- 


autohorse 


7.99±4.17 


6.54±4.11 


7.18±5.16 


6.84±4.13 


6.65±4.23 


SR 


autompg 


2.23±0.21 


2.11±0.48 


2.05±0.26 


2.16±0.21 


2.05±0.25 


DS/DWS 


autoprice 


1974.23± 

326.81 


1659.33± 

290.68 


1518.58± 

339.96 


1660.29± 

367.28 


1532.65± 

357.1 


DS 


auto93 


3.79±1.3 


4.11±1.6 


4.02±1.17 


3.20±1.23 


3.25±1.28 


- 


bodyfat 


0.53±0.23 


0.53±0.22 


0.43±0.26 


0.60±0.21 


0.48±0.24 


DS 


breastTumor 


7.97±1.05 


8.1±1.05 


8.06±0.99 


7.77±0.93 


7.84±0.89 


- 


cholesterol 


39.24±5.88 


40.89±5.73 


38.92±4.64 


38.19±4.62 


38.41±4.44 


- 


cloud 


0.26±0.09 


0.26±0.09 


0.32±0.08 


0.27±0.09 


0.26±0.09 


- 


cpu 


35.02±4.45 


14.22±6.75 


22.25±7.31 


21.24±8.05 


19.36±7.13 


SR 


housini; 


3.41±0.33 


2.82±0.58 


2.68±0.47 


3.29±0.56 


2.96±0.54 


DS 


lowbwt 


364.48± 

48.21 


392.01± 

57.4 


397.93± 

51.48 


356.87± 

62.64 


363.03± 

61.44 


- 


sensory 


0.61±0.04 


0.61±0.04 


0.59±0.05 


0.61±0.06 


0.59±0.06 


- 


servo 


0.63±0.273 


0.38±0.23 


0.45±0.28 


0.63±0.22 


0.44±0.25 


SR 


strike 


221.43± 

38.47 


209.79± 

41.65 


180.84± 

45.72 


203.64±38.33 


189.01± 

42.54 


DS 



Method 


SR 


DS 


DW 


DWS 


Wins/Ties /Losses 


6/7/2 


8/5/2 


4/9/2 


8/7/0 



Table 2. The comparison of ensembles using 5-NN as the base model. 



Data-set 


5-NN 


SR 


DS 


DW 


DWS 


Least MAE 


abalone 


1.61±0.09 


1.54±0.08 


1.73±0.07 


1.54±0.09 


1.54±0.09 


SR/DW/DWS 


autohorse 


8.7±4.69 


7.11±3.71 


5.79±4.57 


6.44±4.93 


6.06±4.92 


DS 


autompg 


2.31±0.38 


2.12±0.34 


2.41±0.35 


2.04±0.35 


2.08±0.35 


DW 


autoprice 


1531. 86± 
404.24 


1478.62± 

460.89 


1382.06± 

336.79 


1438.39± 

460.48 


1397.63± 

454.43 


DWS 


auto93 


3.81±1.4 


3.76±1.11 


4.27±1.32 


3.4±1.52 


3.39±1.58 


DWS 


bodyfat 


2.3±0.49 


0.94±0.21 


1.16±0.28 


1.7±0.37 


1.407±0.34 


SR 


breastTumor 


9.39±1.04 


8.38±0.64 


9.67±1.06 


8.01±0.91 


8.12±0.97 


DW 


cholesterol 


43.0±4.13 


43.39±4.03 


46.17±6.04 


39.64±4.63 


40.36±4.56 


DW 


cloud 


0.51±0.19 


0.38±0.13 


0.39±0.11 


0.39±0.17 


0.36±0.14 


DWS 


cpu 


22.72±13.94 


34.16±17.61 


23.97±12.97 


19.68±13.46 


20.72±14.18 


DW 


housing 


2.59±0.58 


2.30±0.41 


2.56±0.39 


2.39±0.55 


2.27±0.5 


DWS 


lowbwt 


398.3±80.6 


397.8±47.17 


471.35± 

67.56 


365. 88± 
80.53 


369.47± 

74.71 


DWS 


sensory 


0.6±0.06 


0.55±0.06 


0.66±0.07 


0.58±0.05 


0.58±0.05 


SR 


servo 


0.56±0.19 


0.38±0.30 


0.42±0.24 


0.62±0.22 


0.42±0.22 


SR 


strike 


194.62± 

53.46 


222.29± 

46.7 


196.71± 

50.16 


182.25± 

50.01 


176.08± 

50.15 


DWS 



Method 


SR 


DS 


DW 


DWS 


Wins/ Draws/losses 


9/3/3 


4/7/4 


13/2/0 


13/2/0 



In summary, it can be seen that for either base model, at least one of the DI tech- 
niques is as effective as SR, if not more so in reducing the error. Also DWS seemed 
to be the most reliable ensemble approach, as it never significantly increased the 
error. The pattern of behaviour of the DI techniques for regression mirrors that of 
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classification [16] where the best integration method varied with the data-set and the 
base model. 



5.2 Reduced Ensemble Set 

In this section, we repeated the experiments of the previous section, but with the 
addition that the ensemble set had been reduced at the end of the training phase using 
the algorithm described in Figure 1 from N = 25 to M =10 . Table 3 shows the 
results of the comparison of the reduced size ensembles for LR. Comparing the 
ties/wins/losses of Table 3 to Table 1 shows that DW and DS improved in perform- 
ance, DWS remained the same and SR remained approximately the same. 



Table 3. Results of comparison of ensembles using LR. 



Data-set 


LR 


SR 


DS 


DW 


DWS 


abalone 


1.58±0.08 


1.52±0.06 


1.57±0.09 


1.60±0.09 


1.58±0.09 


autohorse 


7.99±4.17 


7.42±3.4 


7.00±4.75 


6.57±3.69 


6.2±3.87 


autompg 


2.23±0.21 


2.09±0.43 


2.09±0.33 


2.13±0.2 


2.08±0.24 


autoPrice 


1974.23± 


1687.68± 


1550.61± 


1723.99± 


1567.55± 




326.81 


233.37 


343.32 


324.96 


356.56 


auto93 


3.79±1.3 


3.521±1.1 


3.91±1.24 


3.43±1.33 


3.41±1.39 


bodyfat 


0.53±0.23 


0.48±0.23 


0.41±0.27 


0.45±0.25 


0.42±0.26 


breastTumor 


7.97±1.05 


8.08±1.09 


8.06±0.99 


7.92±0.95 


7.97±0.89 


cholesterol 


39.24±5.88 


38.94±5.76 


39.01±4.96 


39.05±4.63 


39.03±4.49 


cloud 


0.26±0.09 


0.28±0.09 


0.28±0.10 


0.27±0.10 


0.27±0.10 


cpu 


35.02±4.45 


15.35±6.8 


24.27±6.01 


25.05±7.2 


23.07±7.09 


housing 


3.41±0.33 


2.76±0.37 


2.67±0.45 


3.17±0.42 


2.79±0.47 


lowbwt 


364.48±48.21 


365.46±50.86 


376.69±42.86 


365.19±44.73 


361.59±49.06 


sensory 


0.61±0.04 


0.61±0.04 


0.59±0.04 


0.60±0.05 


0.59±0.05 


Servo 


0.63±0.27 


0.44±0.2 


0.5±0.23 


0.53±0.27 


0.48±0.27 


Strike 


221.43±38.47 


218.47±37.39 


205.04±36.68 


212.62±33.99 


205.39±35.35 



Method 


SR 


DS 


DW 


DWS 


Wins/ Ties/losses 


5/10/0 


8/7/0 


7/8/0 


8/7/0 



There is however more variation in the results than the summary in significance 
comparison alone would suggest. If we calculate the percentage change in MAE be- 
tween the results in Table 1 and Table 3 and average it over all data-sets, the follow- 
ing average percentage changes are shown in Table 4. A positive value is recorded if 
the technique gave on average a percentage reduction in error. 

It is clear that although the average change in MAE is quite small no larger than a 
2 % decrease , the standard deviation is relatively large indicating that for some data- 
sets there is a large percentage change in the MAE. 



Table 4. Percentage change in MAE for ensemble size from N to M. 



Technique 


SR 


DS 


DW 


DWS 


Average percentage change in 
MAE 


-0.45±8.3 


-0.72±6.36 


0.9±9.89 


-1.41±7.41 
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However comparing the reduced ensemble set to the whole ensemble results in de- 
tail shows a general trend that for data-sets where the error increased it did not in- 
crease to change the level of significance, but where the error decreased then in some 
cases it did change the signifcance comparison, e.g. consider the technique DW , for 
the whole ensemble set, autohorse, autoprice, cpu, strike gave an MAE better than the 
base model whereas abalone, and bodyfat were significantly worse. For the reduced 
ensemble set, autohorse, autompg, autoprice, bodyfat, cpu, servo, strike gave an MAE 
significantly better than base model even though for some of these data-sets there was 
a relative increase in MAE. 



Table 5. Comparison of Ensembles with the base model 5-NN. 



Data-set 


5-NN 


SR 


DS 


DW 


DWS 


abalone 


1.61±0.09 


1 . 56 ± 0.09 


1.756±0.07 


1 . 57 ± 0.09 


1.589±0.09 


autohorse 


8.7±4.69 


6 . 14 ± 3.71 


4 . 88 ± 4.92 


4 . 76 ± 4.39 


4 . 77 ± 4.72 


autoMpg 


2.31±0.38 


2 . 13 ± 0.34 


2.34+0.39 


2 . 00 ± 0.39 


2 . 06 ± 0.41 


autoPrice 


1531. 86± 
404.24 


1383.66± 

460.89 


1320 . 41 ± 

236.91 


1313 . 78 ± 

462.28 


1326 . 41 ± 

419.81 


auto93 


3.81±1.4 


3.61±1.11 


4 . 28 ± 1.29 


3.5±1,6 


3.56±1.58 


bodyfat 


2.3±0.49 


0 . 95 ± 0.21 


1 . 07 ± 0.28 


1 . 03 ± 0.32 


1 . 04 ± 0.28 


breastTumor 


9.39±1.04 


8 . 24 ± 0.64 


9.71+0.99 


7 . 99 ± 0.91 


8 . 32 ± 1.0 


cholesterol 


43.0±4.13 


39 . 81 ± 4.03 


46.77±6.89 


40 . 13 ± 4.72 


4 1.21 ±4.77 


cloud 


0.51±0.19 


0 . 37 ± 0.13 


0.37+0.10 


0 . 33 ± 0.14 


0 . 34 ± 0.12 


cpu 


22.72±13.94 


30.97±17.61 


21.10±13.33 


20 . 12 ± 12.48 


21.33±12.95 


housini^ 


2.59±0.58 


2 . 33 ± 0.41 


2.57+0.5 


2 . 33 ± 0.48 


2 . 24 ± 0.41 


lowbwt 


398.3±80.6 


375.46±47.17 


430.79±83.64 


359 . 39 ± 58.65 


367 . 66 ± 64.31 


sensory 


0.6±0.06 


0 . 56 ± 0.06 


0.64±0.08 


0 . 57 ± 0.05 


0.58±0.06 


servo 


0.56±0.19 


0 . 38 ± 0.3 


0 . 44 ± 0.28 


0 . 45 ± 0.29 


0 . 39 ± 0.3 


strike 


194.62±53.46 


208.72±46.7 


195.78±61.88 


187 . 14 ± 52.46 


185.92±56.27 



Method 


SR 


DS 


DW 


DWS 


TiesAVins/losses 


10/3/2 


4/9/3 


14/1/0 


9/6/0 



Table 5 shows the comparison of the reduced ensemble sets when the base model 
was 5-NN. Comparing the results to Table 2 shows that SR, DS, DWS performed 
slightly better with the reduced sets. DWS showed a drop of 4 from 13 to 9 data-sets 
showing a significant improvement in MAE. The average percentage change in MAE 
for the whole ensemble set and the reduced ensemble set for 5-NN is shown in Ta- 
ble 6. However for those 4 data-sets which were no longer significantly better with 
DWS than the base model, the percentage change in MAE was at most 5.6%. The 
same pattern of average error change is similar to LR with a low average pecentage 
change but a higher level of variability in percentage change amongst the data-sets, as 
shown in Table 6. The main difference to the results for LR is that for all techniques 
there was a positive change in the average percentage change in error, with a rela- 
tively large change for DW. 



Table 6. Percentage change in MAE for ensemble size = N to M. 



Technique 


SR 


DS 


DW 


DWS 


Average percentage change in 
MAE 


3.49±4.69 


3.53±5.61 


7.42±13.23 


3.04+9.12 
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In summary, the ensemble size reduction strategy maintains the effective-ness both 
of SR and the DI techniques. In the case DW, the results would suggest that in fact 
pruning the ensemble set actually improves accuracy, a likely consequence that it is 
more sensitive to in-accurate or redundant base models, than the DS and DWS ap- 
proaches, which either select the best model or remove inaccurate models from the 
model combination. 



6 Conclusions and Future Work 

In this paper we have demonstrated that the classification ensemble techniques of 
Dynamic Integration can be adapted to the problem of regression. We have shown 
that for simple base models, these techniques are as effective as Stacked Regression 
for the range of data-sets tested. We have presented a extension to the SR and DI 
techniques which uses the accuracy and diversity measure captured in the training of 
the base models to prune the size of the ensemble thus removing models that are 
ineffective in the model combination. We intend to refine and improve on this simple 
technique as it provides little extra overhead to the algorithms and has shown promis- 
ing results in reducing the ensemble size whilst maintaining its level of accuracy. In 
particular, we intend to investigate in more detail the appropriate choice of accuracy 
threshold and the size of the reduced ensemble set. Also, we shall compare our meas- 
ure for diversity to the more commonly known measures for diversity such as the 
variance based measure developed in [8]. 
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Abstract. Despite the good results provided by Dynamic Classifier Selection 
(DCS) mechanisms based on local accuracy in a large number of applications, 
the performances are still capable of improvement. As the selection is per- 
formed by computing the accuracy of each classifier in a neighbourhood of the 
test pattern, performances depend on the shape and size of such a neighbour- 
hood, as well as the local density of the patterns. In this paper, we investigated 
the use of neighbourhoods of adaptive shape and size to better cope with the 
difficulties of a reliable estimation of local accuracies. Reported results show 
that performance improvements can be achieved by suitably tuning some addi- 
tional parameters. 



1 Introduction 

An alternative to using a single, highly optimised classifier is to “fuse” the outputs of 
multiple classifiers into a so-called multiple classifier system (MCS) [1-4]. If the 
classifiers make complementary decisions, the use of an MCS will improve the classi- 
fication accuracy with respect to that of individual classifiers. Two main strategies for 
classifier fusion have been proposed in the literature, namely classifier combination 
and classifier selection [5-8]. 

The rationale behind the classifier combination approach can be roughly traced 
back to the strategies for collective decisions. As a result, the final classification out- 
put depends on the output of the combined classifiers. As all the classifiers contribute 
to the classification label, these kinds of techniques implicitly assume that combined 
classifiers are “competent” in the entire features space. 

On the other hand, the classifier selection approach assume that each classifier has 
a region of competence in some area of the feature space, i.e. it exhibits the lower 
error rate for patterns belonging to that region. Classifier selection is thus aimed at 
assigning each classifier to its region of competence in the feature space, thus exploit- 
ing the complementarities among classifiers. Two types of classifier selection tech- 
niques have been proposed in the literature. One is called static classifier selection, as 
the regions of competences are defined prior to classifying any test pattern [5]. The 
other one is called dynamic classifier selection (DCS) as the regions of competence 
are determined on the fly, depending on the test pattern to be classified [7, 8], In the 
DCS based on local accuracy, each test pattern x* is labelled by the classifier with the 
highest accuracy computed in a “neighbourhood” of x’. In particular, the k-nearest 
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neighbour technique is used to estimate the accuracy of each classifier in the 
neighbourhood of x*. These estimates can then be used to predict the expertise of each 
classifier, and thus the related regions of competence. 

While a number of experimental results showed the effectiveness of this approach 
[5, 9], the performances of DCS are much lower than that of the oracle, i.e. of an 
ideal selector that always select the optimal classifier. Such a limitation may be the 
consequence of some characteristics of the basic k-nn rule that have been thoroughly 
studied in the past years [10]. In particular, we focussed on the choice of the more 
suited distance metric to compute the neighbourhood, and the choice of the suitable 
value of k. The metric determines the shape of the neighbourhood, while the suitable 
value of k is related to the local density of patterns. 

In Section 2 we briefly recall the basic steps of DCS algorithm in order to highlight 
some critical points that motivated the present study. Section 3 is devoted to the pro- 
posal of two modifications of the DCS algorithm, one related to the distance measure 
used to compute the neighbourhood of the test pattern, the other related to the dy- 
namic choice of the most suitable value of A:. We performed a number of experiments 
on a hyperspectral remote-sensing data set, which are reported in Section 4. The re- 
sults show that the proposed modifications of the basic DCS algorithm allow for im- 
provements in accuracy. At the same time, they suggest that further study is needed to 
bridge the gap between the accuracy of DCS and that of the oracle. 



2 DCS Algorithm 

For each unknown test pattern x", let us consider a local region 91;t(x*) of the feature 

space defined in terms of the fc- nearest neighbours in the validation data. Validation 
data are extracted from the training set and are not used for classifier training. In order 
to define this k-nn region, the distance between x* and patterns in the validation data 
are calculated by a metric Z. As in the A:-nearest neighbour classifier, the appropriate 
size of the neighbourhood is decided by experiment. 

It is easy to see that the classifier local accuracy (CLA) can be estimated as the ra- 
tio between the number of patterns in the neighbourhood that were correctly classified 
by the classifier C^, and the number of patterns forming the neighbourhood of x* [7, 8]. 
We will use these estimates to predict which single classifier is most likely to be cor- 
rect for a given test pattern. 

Let X* be the test pattern and N the number of classifiers 

1. If all the classifiers assign x* to the same class, then the pattern is assigned to 
this class 

2. Compute CLA(x ), j=l, ..., N ( see below for details) 

3. Identify the classifier C,,, exhibiting the maximum value of CLA(x ) 

4. If CLA^(x’)>CLAj(x’) V j A m, select else select the most globally accu- 
rate classifier among those exhibiting the maximum value of CLA. 

In the following section, we will refer to this DCS algorithm in order to investigate 
the effect on the classification performance of two design parameters, namely the 
dimension of the neighbourhood k, and the distance metric Z used to define the fc-nn 
neighbourhood. 
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2.1 Local Accuracy “a priori” 

The CLA “a priori” - a.k.a. “overall local accuracy” [7, 8] - is the percentage of vali- 
dation samples in the local region 9tj.(x*) of x* that are correctly classified by the 
classifier C^. In this case, CLA is computed “a priori”, that is, without taking into 
account the class assigned by the classifier C. to the test pattern x*. 

For each classifier the probability of classifying correctly the test pattern can be 
computed as follows: 

picorrectj ) = kjjk ( 1 ) 

where is the number of “neighbouring” patterns that were correctly classified by the 
classifier Cj and k is the number of patterns in the neighbourhood 9t;t(x*) . 



2.2 Local Accuracy “a posteriori” 

The estimation of the CLA “a posteriori” - a.k.a. “local class accuracy” [7, 8] - ex- 
ploits the information on the class assigned by the classifier C^ to the test pattern x", 
that is, Cj(x*)= C 0 |j. Thus equation (1) can be rewritten as follows: 

piporrectj /c/x* ) = p(x* e / Cj (x* ) = ^ Jk (2) 

where is the number of neighbourhood patterns of x* that have been correctly 
assigned by Cj to the class COj, and ju is the total number of neighbourhood 

patterns that have been assigned to the class co^ by C^. In this case, CLA is computed 
“a posteriori”, that is, after each classifier C. produces the output on the test pattern x*. 

3 A Modified DCS Algorithm 

As CLA estimates are based on the A:-nn rule, they are liable to the same limitations 
that stimulated a number of improvements to the basic fe-nn algorithm [10]. In the 
present paper, we will focus our attention on two issues: i) the A:-nn estimate assumes 
that the local accuracy is almost constant in the local region 91*. (x*) , otherwise the k- 
nn estimate is only asymptotically correct; ii) the reliability of the local accuracy 
estimate depends on the number of patterns included in the neighbourhood 9I,t(x*). 
These issues are influenced respectively by the shape and size of 9tj,(x*) . The shape 
of the neighbourhood 9I,t(x*) depends on the distance metric used, while its size 
depends on the choice of the k parameter. It is easy to see that the size of 91,1 (x*) ‘ for 

a given value of A: - depends on the density of patterns surrounding x*. 

Among the two DCS approaches summarised in sections 2.1 and 2.2, the DCS 
based on the “a posteriori” CLA may be affected by the above issues more deeply 
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than the “a priori” CLA, because the “a posteriori” CLA is computed on the subset of 
patterns belonging to the neighbourhood of x’ that received the same label assigned to 
X*. On the other hand, reported experiments show that the “a posteriori” computation 
of CLA provides higher accuracies than those provided by the “a priori” CLA [7, 8]. 
For these reasons, it is worth investigating the effect of an adaptive A:-nn rule on the 
performances of the “a posteriori” DCS in order to bridge the gap with respect to the 
performances of the oracle. In particular, we propose two enhancements to the basic 
DCS algorithm inspired by their counterparts in the A:-nn literature, namely an adap- 
tive metric for neighbour computation, and a dynamic choice of the value of k. 

In order to explain the rationale behind the proposed enhancements, it is worth re- 
calling that the goal of DCS by CLA is to sort out the “best” classifier in the 
neighbourhood of x*. To this end, the feature space can be considered as divided by 
each classifier into a number of “region of competence” and a number of “region of 
lack of competence”. In the first one the classifier makes the correct decision, in the 
second one the classifier makes the wrong decision [8] (see fig 1). 




Fig. 1. Test pattern x* and its 5-nn. (a) Use of the Euclidean metric. Patterns belonging to 
different regions are included in the 5-nn. (b) Use of an ‘adaptive’ metric. Most of the patterns 
included in the 5-nn belong to the same region which x belongs to. 

The DCS by local accuracy estimate (Sect. 2.2) assumes that it is possible to esti- 
mate the accuracy of classifier C. in x* through the accuracy of C^ in a particular 
neighbourhood 9Ij(x*)of x*. In order to ensure that this estimate is correct, it is 
important that patterns used in this estimate belong to the same region - 
“competence” or “lack of competence” - as x’. Therefore the neighbourhood of x’ 
should be prevalently contained in the region which x* belongs to. In the ^-nn rule for 
classification, this is analogous to assume that locally the posterior probabilities are 
alrSogUcenlstslmD.ws a simple example, where a boundary between regions of compe- 
tence is traced. Accordingly, the validation patterns are labelled as correctly classified 
or incorrectly classified. The test pattern x* in Fig. 1(a) lies close to the boundary of 
the region of competence. A ^-nn neighbourhood 9Ij.(x*) using the Euclidean metric 
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and k=5 is shown. The computation of the CLA in x’ may result in an inaccurate esti- 
mate because it cannot be assumed that CLA is approximately constant in 91 ^.(x*) , as 
patterns in 9tj(x*) belong to a region of competence as well as a region of lack of 
competence. A more accurate estimate of CLA in x* requires that 91^ (x*) is contained 
in the region of competence, as shown in Fig. 1(b). 

To attain more accurate estimates of CLA, we propose three techniques for dy- 
namically adjusting the shape and size of 9tj(x*) . 



3.1 DANN Metric for CLA Computation 

The DANN metric, proposed by Hastie and Tibshirani [11] for the A:-nn classification 
task, provided a solution to the above problem. This technique is based on the exploi- 
tation of the results of the Linear Discriminant Analysis (LDA) performed on a set of 

patterns > A: ) in a nearest neighbourhood of x*. This result is used to reshape 
the nearest neighbourhood 9t^(x ), so that class posterior probabilities are more 
homogenous in the modified neighbourhood. In order to use the DANN algorithm for 
CLA computation, the validation patterns are labelled according to the classification 
result, that is, correct or incorrect. In particular, we proposed a modified DANN algo- 
rithm for CLA computation, that can be summarised in the following steps: 

1 . For each classifier C. and for each validation pattern x., set the meta-label “cor- 
rectly classified” or “incorrectly classified ” 

2. Spread out a nearest neighbourhood of points around the test point x’. 

3. Let I(x’,K„, ,Cj) be the set of validation patterns in the neighbourhood that 
satisfy Cj(Xj) = Cj(x*) 

4. For each x^g I(x*,K„, ,Cj) , compute the following quantities: 

• the within-class covariance matrix 

W = ^ W,. (JC; - )(X,. - f + ^ W; (X,. - )(jC; ~ f 

correct xe incorrect 

• the between class covariance matrix 

^ ~ ^correct^^correct ^^^^coirecl ^incorrect^^incorrect ^^^^incorrect 

where: 

- IV; are the weights assigned to the i-th observation, function of the distance 
d{x-,x ) , 

- Correct ™cl the Weighted means of patterns correctly and incor- 

rectly classified respectively. 

^correct - "^correct /S ’ ^incorrect ~ '^incorrect 

5. Define a new metric E = B 

6. Use the metric 2 to calculate a new Unearest neighbourhood and the adaptive 
local classifier accuracy 

The resulting metric 2 weighs the distances <i^(x;,x*) = (x; -x*)^ 2 (x; -x*) in 
agreement with the local densities of correctly classified and incorrectly classified 
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patterns. As a result, the 5Rj(,(x*) is stretched in the directions of low variability of 
CLA, and shrunk in the directions of high variability of CLA. Thus, 91;^(x*) should 
be contained in a region with almost constant CLA. Further details on the DANN 
algorithm can be found in [ 11 ]. 

3.2 Simplified DANN for CLA Computation 

Given that the task at hand is a two-class problem, where the classes are related to 
correct or incorrect classification, we propose a simplified version of the DANN met- 
ric. Our aim is to obtain a metric whose “unit sphere” is an ellipsoid whose minor axis 
is approximately orthogonal to the boundary between regions of competence and 
lack-of-competence. We can define a metric 2 with eigenvectors Vj, V^, ..., V^, and 
corresponding eigenvalues {a,}, aj>l, 0 ^= 03 =... =a^=l. Let us consider again a 
(K^>k ) nearest neighbourhood of x*. The eigenvectors can be constructed as an or- 
thonormal basis such that Vj lies in the direction linking the barycentres of “correctly 
classified” and “incorrectly classified” patterns. The remaining eigenvectors are com- 
puted according to the Gram-Schmidt technique. In particular, the proposed metric is 
built according to the following steps: 

1. For each classifier C. and for each validation pattern x^ , set the meta-label 
“correctly classified” and “incorrectly classified ” 

2. Spread out a nearest neighbourhood of points around the test point x’. 

3. Let I(x*,K_„ ,Cj) be the set of validation patterns in the K,,, neighbourhood such 
that q(x.)=Cj(x‘) 

4. Let bj and be respectively the barycentre of “correctly classified” patterns 
and the barycentre of “incorrectly classified ” patterns in I(x*,K_^ ,C^. Let Vj be 
the normalized vector Vi = (bj -b 2 y||(bi -b 2 | 

5 . Define a new metric E = a Vi + ^^_2 Vj Vj , where 

• {V;),i = 2...rf are computed by the Gram-Schmidt technique, so that 
{ V; ), i = l...d is an orthonormal basis for the feature space 

• a >1 is the eigenvalue of the eigenvector Vp the other eigenvalues being 
equal to 1 , so that the “unit sphere” is an ellipsoid whose minor axis is collin- 
ear to Vj 

6 . Use the metric 2 to calculate a modified k-nearest neighbourhood. 

We can choose a dynamically, a = a || bj - b^ || H- fo, where a and b are suitable pa- 
rameters, or we can use a constant value. 

3.3 Dynamic Choice of k 

In the “a posteriori” method it is possible that, for a given classifier Cj and for a given 
k, there is no validation pattern x assigned to class Cj(x’), that is, to the same class 
assigned to the test pattern. In this case, it is not possible to calculate the CLA using 
Equation (2). In order to avoid this problem, we propose to choose k dynamically as 
follows: 




180 Luca Didaci and Giorgio Giacinto 

1 . Let N be the number of classifiers, and let k be the initial value of k. 

2. Compute the CLA “a posteriori” for all classifiers; let N^ be the number of 
classifiers for which it is not possible to compute the CLA according to equa- 
tion (2). 

3. If Aj < N/2 or k> k_max, set to zero the CLA for those classifiers, otherwise 
set k= k + k_step and iterate steps 2, 3. 

In this way, we choose a k>k such that the number of classifiers for which is not 
possible to calculate the CLA is less than N/2. Typical values of k_step are in the 
range from 1 to 14. 



4 Experimental Results 

In order to test the performances of the above techniques, we used a hyperspectral 
remote sensing data set, extracted from an AVIRIS image taken over NW Indiana’s 
Indian Pine test site in June 1992 [12]. This data set contains 9345 patterns subdivided 
into 9 spectral classes. Each pattern is described by a feature vector of 202 compo- 
nents. This is an interesting application for testing the proposed DCS technique be- 
cause of two main reasons: 

- the available data allowed to extract a validation set (2885 patterns), a training set 
(3266 patterns), and a test set (3194 patterns) with comparable sizes, so that all the 
spectral classes are represented in either sets; 

- the high dimensionality of the data suggests the use of an MCS made up of classi- 
fiers trained on different subspaces. 

For the task at hand, we divided the entire dataset into five subspaces, the i-th sub- 
space containing the features i, i+5, i-i-10, etc. This subdivision ensures that each sub- 
space contains a number of uncorrelated features, as adjacent channels exhibit high 
correlation. 

For each subspace we trained a k-nn classifier using A: = 10. It is worth noting that 
any other base classifier can be used to test the performance of DCS. 

In the following tables, results attained on the test set are reported. In order to em- 
phasize the effects of the proposed DCS modifications, we also report the results 
attained on a reduced test set (797 patterns), that is, those patterns for which at least 
one classifier gives the correct classification result, and at least one classifier provide 
an incorrect classification. In other words, the reduced test set include only those 
patterns for which a selection mechanism is needed. 

Table 1 shows the overall accuracy provided by each classifier on the test set and 
on the reduced test set. As each classifier is almost 50% correct on the reduced test 
set, it is worth to use a selection mechanism to improve the classification accuracy. 



Table 1. Overall accuracy on the complete test set and on the “reduced” test set. 





C1 


C2 


C3 


C4 


C5 


Oracle 


Complete test set 


70.79% 


70.79% 


72.38% 


71 .65% 


69.97% 


82.97% 


Reduced test set 


51.19% 


51.19% 


57.59% 


54.96% 


47.93% 


100% 
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Table 2 summarises the seven experiments that have been carried out in order to 
make a thorough evaluation of the proposed mechanisms by varying the design pa- 
rameters. In particular, experiments 1 to 4 are aimed at comparing the performances 
of DCS when different Minkowski metrics are used. For all the experiments, we used 
different values of the parameters k, K„,, a, in order to evaluate the sensitivity of the 
accuracy. 

Table 2. List of experiments. 



Name 


Metric 


Parameters 


Note 


EXPl 


Euclidean metric 






EXP2 


Euclidean metric 


k_step from 1 to 1 4 


Dynamic choice of k (sect 3.3) 


EXP3 


City block distance 






EXP4 


City block distance 


k_step from 1 to 1 4 


Dynamic choice of k (sect 3.3) 


EXP5 


Simplified DAMN metric 


Km from 51 to 324 
a from 4 to 1 00 


Sect 3.2 


EXP6 


Simplified DAMN metric 


Km from 51 to 324 


Dynamic choice of a (sect 3.2) 


EXP7 


DANN metric 


Km from 51 to 324 


Sect 3.1 


1 Common parameters 


/cfrom 1 to 51 





Table 3 shows the best overall accuracy of the proposed mechanisms for the 7 ex- 
periment settings. For the sake of comparison, the performances of the best individual 
classifier and the “oracle” are also shown. 

Table 3. Best overall accuracy on the entire test set and on the reduced test set. The difference 
in respect to the best classifier is also reported. 





Entire Test Set 


Reduced Test Set 






Accuracy 


A accuracy 


Accuracy 


A accuracy 


Best parameters 


Best classifier 


72.38% 


0 


57.59% 


0 




EXPl 


74.11% 


1 .73% 


64.49% 


6.90% 


/c=12 


EXP2 


74.74 


2.36% 


67% 


9.41% 


/c=3; k_step=9 


EXP3 


75.27% 


2.89% 


69.13 


1 1 .54% 


/c=10 


EXP4 


75 . 52 % 


3.14% 


70 . 14 % 


12.55% 


/c=10;k_step=10, 14 


EXP5 


75.02% 


2.64% 


68.13% 


10.54% 


/c=12, Km=51, a=50 


EXP6 


74.89% 


2.51% 


67.63% 


10.04% 


/C=9, Km=51 


EXP7 


75 . 48 % 


3.10% 


70 . 01 % 


12.42% 


/c=l6, Km=324 


Oracle 


82.97% 


10.59% 


100% 


42.41% 





We can notice that an accurate choice of the DCS parameters allows outperforming 
the best classifier of the MCS. In addition, the proposed modifications of the DCS 
mechanism allow outperforming the basic DCS algorithm (EXPl). In particular the 
highest accuracies have been attained in EXP4 and EXP7, where the size (EXP4) and 
the shape (EXP7) of the neighbourhood have been dynamically chosen. 

In order to evaluate the distribution of the results with respect to the choice of the 
design parameters. Tables 4 shows the maximum, medium, minimum, and standard 
deviation of the accuracy attained in a .number of experiments. In particular, Table 4 
shows the results attained using either a fixed value of A: = 12, or a varying value of k 
in the range [1,51]. The value of k = 12 represented the best value for the DCS based 
on the Euclidean metric. As it would be expected, when the value of k is fixed, EXP2 
and EXP4 exhibit the lowest value of standard deviation as the only design parameter 
is k_step. On the other hand, when the value of k is not fixed the use of an adaptive 
metric allows attaining smaller values of standard deviation. 
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Table 4. Maximum, mean, minimum and standard deviation of the accuracy on the reduced test 
set for varying values of the parameters. 





/c=12 


/c=1..51 


Max 


Mean 


Min 


Std 

dev 


Max 


Mean 


Min 


Std 

dev 


Best classifier 


57.59% 


57.59% 


57.59% 


0 


57.59% 


57.59% 


57.59% 


0 


EXP1 




64.49% 


64.49% 


64.49% 


0 


64.49% 


62.85% 


60.23% 


0.753 


EXP2 


k_step=1 ..14 


65.12% 


64.82% 


64.62% 


0.175 


67.00% 


63.14% 


61.61% 


0.95 


EXP3 




67.63% 


67.63% 


67.63% 


0 


69.13% 


65.66% 


62.74% 


1.584 


EXP4 


k_step=1 ..14 


68.38% 


68.29% 


68.13% 


0.077 


70.14% 


66.02% 


63.31% 


1.91 


EXP5 


K„,=51..324; 

a=4..100 


68.13% 


65.54% 


64.24% 


1.005 


68.13% 


64.66% 


58.85% 


1.225 


Km=51(best); 

a=4..100 


68.13% 


66.66% 


65.12% 


1.256 


68.13% 


65.07% 


64.60% 


1.292 


K„,=51..324; 
a=50 (best) 


68.13% 


65.84% 


64.24% 


1.688 


68.13% 


64.82% 


59.72% 


1.331 


EXP6 


K„,=51..324 


67.50% 


65.46% 


64.24% 


1.41 


67.63% 


64.61% 


58.97% 


1.399 


Km=51 (best) 


67.50% 


67.50% 


67.50% 


0 


67.63% 


65.14% 


61.36% 


1.498 


EXP7 


K„,=51..324; 


69.76% 


67.97% 


65.75% 


1.75 


70.01% 


67.02% 


59.85% 


1.605 


K„,=324 

(best) 


69.76% 


69.76% 


69.76% 


0 


70.01% 


67.97% 


62.48% 


1.289 



Table 4 also shows a number of experiment settings aimed at evaluating the influ- 
ence on the performances of the different design parameters. It can be seen that the 
performances of the simplified DANN metric (EXP5 and EXP6) exhibit smaller val- 
ues of the standard deviation than that of the original DANN metric. Thus, the DANN 
metric can outperform the other considered metrics thanks to a fine-tuning of the 
design parameters. On the other hand, the performances of the simplified DANN 
metric are less sensitive to a non-optimal choice of the values of the design parame- 
ters. 



Table 5. Accuracy on the entire test set attained with the optimal parameters estimated from the 
validation set, and the optimal parameters estimated from the test set. 





Parameters estimated 
from the validation set 


Accuracy 


Parameters estimated 
from the test set 


Accuracy 


Best 

classifier 


- 


72.38% 


- 


72.38% 


EXP1 


/c=19 


73.64% 


/c=12 


74.11% 


EXP2 


k=19; k_step=1-14 


73.70% 


/c=3; k_step=9 


74.74 


EXP3 


/c=6 


74.99% 


/c=10 


75.27% 


EXP4 


/c=6;k_step=14 


75.36% 


/c=10;k_step=10, 14 


75.52% 


EXP5 


/c=41, K„,=175, a=50 
/c=47, Km=175, a=20 


73.82% 

73.92% 


/c=l2, K„=51, a=50 


75.02% 


EXPO 


k=35, 36, Km=175 


74.15% 


k=9, K„,=51 


74.89% 


EXP7 


/c=8,10, K„=175 


74.86%; 75.05% 


/c=l6, K„=324 


75.48% 



It can be seen that the use of an adaptive metric allows improving the accuracy of a 
fixed metric, thus confirming the influence of the shape and size of the neighbour- 
hood in the computation of the CLA. In addition, the values of the standard deviation 
show that a sub-optimal choice of the design parameters is still capable to provide 
improved results. As an example, such a sub-optimal choice can be provided by esti- 
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mating the optimal parameter values from the validation set. Table 5 shows that this 
choice of the parameters allows attaining performances higher than those provided by 
the “standard” DCS algorithm. In addition, these results are quite close to those re- 
lated to the optimal choice of the parameters estimated directly from the test set. 
Thus, it can be concluded that the use of an adaptive metric can improve the accuracy 
of the DCS based on local accuracy estimates. 
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Abstract. Diversity among base classifiers is known to be a necessary condi- 
tion for improved performance of a classifier ensemble. However, there is an 
inevitable trade-off between accuracy and diversity, which is known as the ac- 
curacy/diversity dilemma. In this paper, accuracy and diversity are incorporated 
into a single measure, that is based on a spectral representation and computed 
between pairs of patterns of different class. Although the technique is only ap- 
plicable to two-class problems, it is extended here to multi-class through Output 
Coding, and a comparison made between various weighted decoding schemes. 



1 Introduction 

The idea of combining multiple classifiers is based on the observation that achieving 
optimal performance in combination is not necessarily consistent with obtaining the 
best performance for a single classifier. Although it is known that diversity among 
base classifiers is a necessary condition for improvement in ensemble performance, 
there is no general agreement about how to quantify the notion of diversity among a 
set of classifiers. The desirability of using negatively correlated base classifiers in an 
ensemble is generally recognised, and in [1] the relationship between Diversity and 
majority vote accuracy is characterized as classifier dependency is systematically 
changed. Diversity Measures can be categorised into pair-wise and non-pair-wise, but 
to apply pair-wise measures to finding overall diversity it is necessary to average over 
the classifier set. These pair-wise diversity measures are normally computed between 
pairs of classifiers independent of target labels. As explained in [2], the accuracy- 
diversity dilemma arises because when base classifiers become very accurate their 
diversity must decrease, so that it is expected that there will be a trade-off. 

A possible way around this dilemma is to incorporate diversity and accuracy within 
a single measure, as suggested in this paper. The measure is based on a spectral repre- 
sentation that was first proposed for two-class problems in [3], and later developed in 
the context of Multiple Classifier Systems in [4]. It was shown for two-class problems 
in [5] that over-fitting of the training set could be detected by observing the spectral 
measure as it varies with base classifier complexity. Since realistic learning problems 
are in general ill-posed [6], it is known that any attempt to automate the learning task 
must make some assumptions. In this paper, in comparison with [5], one of the as- 
sumptions is relaxed (equ. (3) and equ. (5) in Section 2), and the only assumption 
required is that a suitable choice be made for the range over which base classifier 
complexity is varied. In this paper, the technique is applied to multi-class problems 
through Output Coding (Section 3), and a comparison of various weighted decoding 
schemes is included. 
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2 Diversity/Accuracy over Pattern Pairs 

Consider the following MCS framework in which a two-class supervised learning 
problem is solved by B parallel base classifiers whose outputs are combined by voting 
or summation. It is assumed that there are ji training patterns with the label given to 
each pattern denoted by co =f(XJ, where m = 7be mth pattern may then be 

represented by the B-dimensional vector formed from the (real-valued) base classifier 
outputs given by 

(0,1}, i = ( 1 ) 

If one of two classes is assigned to each of B base classifiers, the decisions may be 
treated as binary features. Then the mth pattern in (1) can be represented as a ver- 
tex in the 5-dimensional binary hypercube, resulting in a binary-to-binary mapping 

tXnd CO E (0,1 j , i = 1 .. .B (2) 



In [3], a spectral representation olf(X) is proposed for characterising the mapping 
in equ. (2). A well known property of the transforms that characterise these mappings 
(e.g. Rademacher- Walsh transform) is that the first order coefficients represent the 
correlation between /(X) and x [7]. In [5] an estimate of first order coefficients, based 
on Hamming Distance between pattern pairs of patterns of different class, is pro- 
posed for noisy, incompletely specified and perhaps contradictory patterns. The mth 
pattern component is assigned (j=l,2,...B) 




y ^mj ® ^nj 



fiXJ^fiXJ 



( 3 ) 



where c = H- if X^j = f (X , c = - if X^j = f (X^ ) , and © is logic XOR. 

Examples of applying (3) to simple Boolean functions are given in [3]. The jth 
component x„,. of a pattern pair has associated only if the jth base classifier mis- 
classifies both patterns. Therefore we expect that a pattern with relatively large is 
likely to come from regions where the two classes overlap. The measure for nth 
pattern looks at the relative difference between excitatory and inhibitory contribu- 
tions, normalised so that -1 < (j/< 1, defined by 



1 * 



"CT 7=1 



^ ^ ^mj 



n—1 



7=1 



"7 _| «7 

U U 



(4) 
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In [4] plots of cr are given and interpreted as a measure of class separability, indi- 
cating how well the nth pattern is separated from patterns of the other class by the set 
of B classifiers. It may be compared with the Margin (M) for the nth pattern, which 
represents the confidence of classification. Cumulative Distribution graphs can be 
defined for cr similar to Margins, that is g ( cr ) versus cr where g ( <7^ ) is the fraction 
of patterns with value at least cr . Areas under the distribution are compared in [5], but 
here we plot mean over positive cr and M 

r = -Xr„,r„>0, Te (5) 

where (j'J^\s a variant of o'^ that sets to 1 in (3), that is removing the assump- 
tion that the contribution between pattern pairs is inversely proportional to D„. 

Pair-wise diversity measures, such as Q statistic, correlation coefficient p, Double 
Fault F and Agreement A (1 -Disagreement) measures [1], are computed over classi- 
fier pairs, using four counts, defined between ith and jth classifiers 

Nt- = A vij a,be[0,U, yj ' = = y, (6) 

m=l 

where A is logical AND, and y^. =1 if x^. =f(XJ. 

For conventional diversity measures the mean is taken over B base classifiers 

AEia,p,A,F) ,7, 

1 ) j=i+i 

where A represents the Diversity Measure between ith and jth classifiers. 

For comparison, we also calculate mean Diversity Measures over patterns of dif- 
ferent class (A ') using counts as follows 

Nl = * fix,) (8) 

M 

from which 7^ /(XJ (9) 

fn=\ n=\ 

Note that equ. (4) may be re-arranged to express in the notation used in (8) 

since ^ ^ “ and ^ ^ “ 

7=1 «=1 7=1 n=l 



3 Output Coding and Multi-class Problems 

The idea of using codes is based on modelling the multi-class learning task as a com- 
munication problem in which class information is transmitted over a channel. The k x 
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B code word matrix Z has one row (code word) for each of k classes, with each col- 
umn defining one of B sub-problems that use a different labelling. Assuming each 
element of Z is a binary variable x, a training pattern with target class O), (I = 1... k) is 
re-labelled as class if = x and as class if ^ij = ^ ■ The two super-classes 
and Qj represent, for each column, a different decomposition of the original problem. 
In the test phase the pth pattern is assigned to the class O). that is represented by the 
closest code word, where distance of the pth pattern to the ith code word is defined as 

B 

^ = (10) 

j=i 

where a, allows for Ith class and jth classifier to be assigned a different weight. 
Hamming decoding is denoted in equ. (10) by {a.=l, v^j=x^j equ (2)} and L' norm 
decoding by equ. (1)}. In addition to the Bayes error, errors due to indi- 

vidual classifiers and due to the combining strategy can be distinguished. This can be 
further broken down into errors due to sub-optimal decomposition and errors due to 
the distance-based decision rule. If it is assumed that each classifier provides exactly 
the probability of respective super-class membership, with posterior probability of Ith 
class represented hy q^^ {I = 1 ... k), from equation (10), assuming OC.=l, it is shown in 
[8] that 






pi 



= I 






/=1 



=a--?„)I(z,;-z„ 



(11) 



y=i 



Equation (11) tells us that D^. is the product of (1-qJ and Hamming Distance be- 
tween code words, so that when all pairs of code words are equidistant, minimising 
D^. implies maximising posterior probability, which is equivalent to Bayes rule. 
Therefore any variation in Hamming distance between pairs of code words will re- 
duce the effectiveness of the combining strategy. Further criteria governing choice of 
codes are discussed in [8], but finding optimal code matrices satisfying the criteria is a 
complex problem, and longer random codes have frequently been employed with 
almost as good performance. Many types of decoding are possible but theoretical and 
experimental evidence indicates that, providing a problem-independent code is long 
enough, performance is not much affected. There is recent interest in adaptive, prob- 
lem-dependent and non-binary codes [9], but in this paper a random code matrix with 
near equal split of classes (approximately equal number of I’s in each column) is 
chosen [10]. 

When base classifiers are sub-optimal and vary in accuracy (unbalanced [11]) it 
may be that the decoding strategy can assist in reducing error and in Section 5 various 
weighted decoding schemes are compared. All weights proposed in this study are 
fixed in the sense that none change as a function of the particular pattern being classi- 
fied, which is sometimes referred to as implicit data-dependence or constant weight- 
ing. It is generally recognized that a weighed combination may in principle be supe- 
rior, but it is not easy to estimate the weights. Fixed weighting coefficients (a., 
j=l...B, I = 1 ... k ), estimated from training data are proposed as follows 
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a 



a' 



1 A M 

= — (Y (j\-y CT^.) 

^ mj ^ mj-' 

^ Ct' m=l m=l 



( 12 ) 



a 



<y'(H) _ 



A A 



K 






,<T m=l n=\ 



(13) 



='^--^'L'L^^ymJ ^ynj)-(yn,j ^ynj)} ' f m) ^ f n) (14) 

^ m— 1 n—\ 

where y_^. is defined in equ. (6) 

These weighting functions are intended to give more weight to classifiers that sepa- 
rate the two classes well, in contrast to the following weighting, which is based on the 
Agreement measure 

1 ^ 

«;=— (15) 

-^,4 !=1 

Normalization constants K in equ. (12) to (15) are chosen so that weights sum to 1, 
and negative weights in (12) and (13) are set to zero. Note that decodings in (12) and 
(13) have different weights for each class, which come hy computing a''”’ and a' from 
the k hinary-to-binary mappings defined by one-vs-rest coding with respect to base 
classifier output decisions. 



4 Experimental Evidence 

Multi-class benchmark problems have been selected from [12], and the experiments 
use random 50/50 or 20/80 training/testing splits. The datasets, including number of 
patterns, are Segment (2310), Iris (150), Ecoli (336), Yeast (1484), Vehicle (846), 
Vowel (990). All experiments are performed with one hundred single hidden-layer 
MLP base classifiers, except where the number of classifiers is varied. The algorithm 
is the Levenberg-Marquardt training algorithm with default parameters, and the num- 
ber of training epochs is systematically varied for different runs of the MCS. Random 
perturbation of the MLP base classifiers is caused by different starting weights on 
each run, and each point on the graphs is mean over ten runs. 

For the datasets tested in this paper, none over-fitted for the range of base classifier 
complexity values considered. To encourage over-fitting the experiments were carried 
out with varying classification noise, in which a percentage of patterns of each class 
were selected at random (without replacement), and each target label changed to a 
class chosen at random (from patterns of the remaining classes). Figure 1 gives test 
and train error rates for Segment 50/50 + 20% classification noise with [2,4,8,16] 
hidden nodes, and Figure 2 shows various measures defined in equ. (5) (7) (9). Figure 
3 shows test error, margin M and (equ. (5)) for Iris 50/50 at 8 nodes with 
[0,10,20,30,40] % classification noise. By comparing Figure 1 and Figure 2 at 16 
nodes, and observing Figure 3, it appears that a'*”’ may be a good predictor of test 
error and detect over-fitting. Note also in Figure 3, for more than 27 epochs, that 
over-fitting reaches the point where a'*”’ starts to increase. This is not unexpected 
since as number of nodes and epochs becomes very large, and all patterns become 
correctly classified a'™ — » 1 . 
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Mean Base Test(2 class) 



Hamming Decoding Test 




Fig. 1. Error rates for Segment 50/50 + 20% noise with [2,4,8,16] nodes. 



^(H) 






Number of Epochs Number of Epochs 

Fig. 2. Measures for Segment 50/50 + 20% noise for [2,4,8,16] nodes. 



Table 1 shows mean correlation coefficient, over all 50/50 and 20/80 datasets+20% 
noise, of test error with respect to [1-69] epochs over [2,4,8,16] nodes. The correla- 
tion is shown for base classifier, Hamming and LI norm decoding (defined equ. (10)) 
and demonstrates that is well correlated with base classifier test error. However, 
most measures appear well correlated, and the experiments were repeated for the 
restricted range of epochs 2-20 at 2 epoch intervals, with varying noise. Table 2 
shows correlation coefficient for 50/50 datasets of base classifier test error at 8 nodes 
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averaged over [0,10,20,30,40] %noise. Table 3 has the same data as Table 2 but for 
20/80 datasets, and Table 4 shows mean correlation coefficient, over all 50/50 and 
20/80 datasets, of test error with respect to [2-20] epochs at 8 nodes averaged over 
[0,10,20,30,40] %noise. Table 2 to Table 4 indicates that a'® is better correlated with 
base classifier test error than other measures, particularly for 20/80 datasets. Also a'® 
was the only measure that was significantly correlated with base classifier test error 
for all datasets, 50/50 and 20/80 (95% level in comparison with random chance). 



Mean Base Test(2 class) 



Hamming Decoding Test 







K 

O 

UJ 





1 2 3 4 7 10 17 27 43 69 

Number of Epochs 




1 2 3 4 7 10 17 27 43 69 

Number of Epochs 



Fig. 3. Test error, M and o'®’ for Iris 50/50 with [0,10,20,30,40] % noise. 



Figure 4 shows the mean difference, over all 20/80 datasets, between Hamming 
and various other decoding schemes as number of classifiers (number of columns of 
code matrix Z) is increased. L' , a'’, a'’®, a*, cl' are defined in equ. (10), (12)-(15) 
and a‘“‘“ is chosen according to the Adahoost logarithmic function of training error. 
Both CL^ and a'’® give dramatic improvement for 1-4 epochs, which is almost 
matched by but not by a* and a*. As base classifiers become more unbalanced at 
small number of epochs, weighted decoding appears to be more effective, l! is better 
than Hamming decoding although there is little difference when number of classifiers 
increases above 40. 

Other datasets tested were Dermatology, Waveform and Soybean, none of which 
showed a good correlation between a'®’ and test error. These three datasets have dis- 
crete features, in contrast to the other six datasets which all have continuous features. 
It is believed that the effect of discrete features on the probability distribution in the 
overlap region may account for this result, but further work is required to corroborate 
the observation. 



5 Conclusion 

The proposed measure, calculated over the training set, is capable of detecting over- 
fitting of base classifier test error, and can be used to design a weighted vote com- 
biner. Although the measure is only applicable to two-class problems, experimental 
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results in this paper demonstrate that it may also be applied to the artificial two-class 
decompositions induced by Output Coding. 










Number of Epochs Number of Epochs 

Fig. 4. Test Error minus Hamming Decoding Error over 20/80 datasets for various decoding 
schemes with [10 20 40 80 160] classifiers. 



Table 1. Mean con-elation coefficient (xlOO), over 50/50 and 20/80 datasets -1-20% noise, of test 
enor with respect to [1-69] epochs over [2,4,8,16] nodes. 





BASE 

CLASSIEIER 


HAMMING 

DECODING 


L‘ NORM 
DECODING 


MEAS 


50/50 


20/80 


50/50 


20/80 


50/50 


20/80 


-Q 


75 


25 


68 


33 


71 


36 


■P 


88 


58 


79 


62 


82 


64 


-A 


82 


76 


65 


56 


64 


51 


F 


87 


84 


72 


67 


71 


63 


■Q’ 


75 


68 


57 


48 


57 


44 


-P' 


73 


67 


55 


46 


55 


43 


-A' 


80 


75 


64 


56 


63 


52 


F' 


90 


86 


75 


70 


74 


66 


-a' 


82 


71 


68 


61 


68 


57 




91 


84 


76 


68 


76 


63 


-M 


73 


69 


57 


49 


56 


45 
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Table 2. Correlation coefficient (xlOO) for 50/50 datasets, of base classifier test error with 
respect to [2-20] epochs at 8 nodes averaged over [0,10,20,30,40] % noise. 



Dataset 


- Q 


- A 


F 


-Q' 


-A' 


F ' 


- a ' 




- M 


Segment 


97 


96 


97 


97 


95 


99 


98 


97 


93 


Iris 


82 


74 


73 


50 


71 


81 


97 


88 


64 


Ecoli 


75 


75 


78 


67 


69 


83 


85 


86 


62 


Yeast 


95 


90 


92 


86 


87 


93 


98 


96 


79 


Vehicle 


67 


76 


86 


75 


77 


88 


89 


93 


67 


Vowel 


72 


90 


97 


87 


90 


98 


22 


98 


83 



Table 3. Correlation coefficient (xlOO), for 20/80 datasets, of base classifier test error with 
respect to [2-20] epochs at 8 nodes averaged over [0,10,20,30,40] %noise. 



Dataset 


- Q 


- A 


F 


-Q' 


-A' 


F ' 


- a ' 


-o'" 


- M 


Segment 


68 


91 


90 


89 


88 


94 


96 


98 


84 


Iris 


39 


31 


33 


12 


30 


39 


60 


70 


22 


Fcoli 


26 


13 


19 


2 


8 


29 


58 


75 


-3 


Yeast 


86 


63 


65 


51 


57 


70 


95 


93 


44 


Vehicle 


-1 


52 


72 


45 


54 


74 


56 


78 


42 


Vowel 


-32 


80 


94 


70 


81 


95 


40 


91 


74 



Table 4. Mean con'elation coefficient (xlOO), over all 50/50 and 20/80 datasets, of test error 
with respect to [2-20] epochs at 8 nodes averaged over [0,10,20,30,40] % noise. 





BASE 




HAMMING 


L‘ NORM 


• 


CLASSIFIER 


DECODING 


DECODING 


MEAS 


50/50 


20/80 


50/50 


20/80 


50/50 


20/80 


-Q 


75 


31 


68 


41 


71 


42 


■P 


88 


56 


79 


59 


82 


61 


-A 


82 


55 


65 


31 


64 


33 


F 


87 


62 


72 


39 


71 


41 


■Q’ 


75 


45 


57 


23 


57 


24 


-P' 


73 


42 


55 


21 


55 


22 


-A' 


80 


53 


64 


29 


63 


31 


F' 


90 


67 


75 


44 


74 


45 


-a' 


82 


68 


68 


65 


68 


66 




91 


84 


76 


75 


76 


75 


-M 


73 


44 


57 


21 


56 


23 
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Abstract. We seek to address the issue of multiple classifier formation 
within Luttrell’s stochastic vector quantisation (SVQ) methodology. In 
particular, since (single layer) SVQs minimise a Euclidean distance cost 
function they tend to act as very faithful encoders of the inpnt: however, 
for sparse data, or data with a large noise component, merely faithfnl 
encoding can give rise to a classifier with poor generalising abilities. We 
therefore seek to asses how the SVQs’ ability to spontaneously factorise 
into independent classifiers relates to overall classification performance. 
In doing so, we shall propose a statistic to directly measure the ag- 
gregate ‘factoriality’ of code vector posterior probability distributions, 
which, we anticipate, will form the basis of a robust strategy for deter- 
mining the capabilities of stochastic vector quantisers to act as nnified 
classification/classifier-combination schemes. 



1 Introduction 

1.1 Stochastic Vector Quantisation 

Stochastic vector quantisation (SVQ) [eg 1-3] can be considered to serve to 
bridge the conceptual gap between topographic feature mapping [5] and standard 
vector-quantisation [6]. It does so by utilising a ‘folded’ Markov chain topology 
to statistically relate input and output vectors conceived as occupying the same 
vector space via the minimisation of a positional reconstruction error measure. 
That is, for the input vector x, we seek to minimise the aggregate Euclidean 
distance: 




where x' is the output vector, and y = (j/i,?/ 2 , ■ • -yn) '■ 1 < y < Af the code- 
index vector encoding of x. 
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Although soluble non-parametrically for certain theoretical cases, in practise, 
a number of constraints are required to achieve this minimisation [3], the most 
significant of which for the present purposes being the limitation of Pr(y\x) to 
a sigmoid form: 



Pr{y\x) 



Qjy\x) 



(2) 



where 

Q{y\x) - ^ exp(-w(y).a; - b{y)) 



(3) 



Here b represents a bias offset in the input space and uj a weight vector with 
behaviour characteristics familiar from the study of artificial neural networks 
(although the normalisation factor can considerably modify the intrinsic sigmoid 
morphology). It is possible, within this framework, to concatenate chains of 
SVQs together to form a multistage network with the previous code- vector space 
becoming the input space of the subsequent SVQ, in which case the objective 
function defaults to a weighted sum of the reconstruction errors of each stage in 
the chain. It is also the case that the topographic aspects inherent in equation 
1 can be explicitly pre-specified through the incorporation of a ‘leakage current’ 
term: SyiP{y\y')p{y'\x) in equation 2. 

The method by which we shall attempt to attain the global minimum of the 
quantity D for both single and multistage SVQs proceeds, in the current paper, 
via a minor adaptation of Luttrell’s original method [3], whereby a pseudo- 
annealing regime is employed within which step-size updates between gradient- 
descent iterations are adjusted logarithmically until convergence is attained. In 
this way we mimic the ideal logarithmic temperature reduction of simulated 
annealing, albeit without a thermalisation step [cf eg 7]. When we later come 
to apply additional constraints on the convergence, it will be assumed that the 
underlying convergence mechanism is still that of a logarithmic gradient descent 
minimisation of D. 

An important consequence arises from this uniquely probabilistic form of 
vector quantisation: the act of maximising the error resilience of the necessarily 
bandwidth-limited transmission of stochastically sampled pattern-vector code 
information over the folded Markov chain structure gives rise to a completely 
natural mechanism for imposing an appropriately-dimensioned topology on the 
training vectors. In particular, it becomes possible to perform a factorial decom- 
position of the input when strong factor independence is indicated by the data 
structure (and the SVQ ‘bandwidth’ is not significantly in excess of that required 
to represent the factorised code- vector distribution). That is, the SVQ method- 
ology is able to serve as a factor analyser as required, without any a priori imposi- 
tion at the conceptual level, the topological constraints on the SVQ arising solely 
as a consequences of the data morphology and the requirement of representing it 
in the most efficient and noise-robust manner possible (efficient representation, 
in effect, dictating factorial-decomposition, and the error-robustness constraint 
imposing a proximity-based topology on the code- vector priors). In the absence 
of well defined independent factors in the data (or with an excess parametric 
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freedom in the SVQ), the proximetric aspect of the code-transmission error esti- 
mation robustness tends to dominate, and the SVQ code-vectors act to faithfully 
encode only cluster proximities within the training data. 

1.2 Objective of the Current Investigation 

Beyond being an indicator of SVQ representative efficiency, the concept of fac- 
toriality is, we expect, also significant as an indicator of potential classification 
performance. Thus taking a factorial perspective on the mechanics of SVQ con- 
vergence significantly differs from (but is in no way exclusive of) the character- 
isation of the SVQ state solely via the objective function, which describes the 
closeness of fit to the training vectors, but does not, however, in itself give an 
indication of the extent to which the SVQ has captured all of the morphological 
aspects of the classification problem under consideration. 

We should, in the following paper, therefore like to test the intuition that 
the maximally factorial behaviour in relation to the SVQ information band- 
width parameters (sampling frequency and code vector cardinality) coincides, 
in general, with the peak of classification performance on typical test data-sets, 
and consequently with the actual underlying pattern-vector probability density 
distribution in so far as it is representable by the SVQ. 

Our motivation for this stems from a general observation (experimentally for- 
malised in section 3), that when the number of code indices is increased linearly, 
SVQs undergo a behavioural transition as the mechanism firstly characterises 
individual localities of the pattern space without co-ordinatising any particular 
sub-manifold, before making a transition to the factorial encoding regime at a 
point at which there is sufficient parametric freedom available to the SVQ for it 
to efficiently describe the predominating sub-manifold of the training data. As 
further code indices are made available to the SVQ, the method once again tends 
toward joint encoding, this behaviour apparently correlating with a reduction in 
the generalising ability of the classifier as the SVQ starts to characterise localised 
peculiarities of the training data that are not shared by the test set. The criterion 
of ‘maximum factoriality’ might therefore be supposed to serve to determine the 
appropriate amount of informational bandwidth required to efficiently charac- 
terise the underlying class PDFs from which the pattern data are drawn, beyond 
which the effect of over-classification begins to predominate, as too much para- 
metric freedom is allocated in relation to the training set, and below which the 
classification suffers from insufficient data description bandwidth. 

It is therefore this hypothesis, namely that maximal factoriality with respect 
to the various information bandwidth parameters corresponds to the most effi- 
cient descriptor of the pattern space, corresponding in turn to the most effectively 
generalising classifier, that we shall seek to test in the following paper. 

1.3 Paper Structure 

In addressing, then, the issues surrounding SVQs as general classifiers, our first 
specific objective, after having derived an appropriate measure of SVQ factori- 
ality in section 2, will be to provide experimental evidence of the assertion that 
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factoriality correlates with classifier generalising ability, firstly, by demonstrating 
the correlation between SVQ ‘information bandwidth’ and factorisation of the 
code vectors with respect to simulated data, and secondly, via a demonstration 
that the maximum of SVQ factoriality corresponds to the optimal classification 
performance (constituting sections 3 and 4, respectively). Following this, we shall 
turn in section 5 to a brief consideration how maximised factoriality might be 
accomplished within the existing framework of Euclidean error objective func- 
tion minimisation, findings suggesting that an appropriate upper-layer weighting 
will encourage factorial encoding of the initial layer as well as effectively acting 
as a combination scheme for the factorised classifiers. 

2 Aggregate Weight Collinearity 
as a Factoriality Measure 

The most appropriate statistic of classifier performance has thus been hypoth- 
esised to be one that gives a measure of the degree to which SVQs factorise 
in relation to a particular training set. Any such realisation of the proposed 
statistic would, necessarily, have to be robustly independent of both the bias 
factors and the feature dimensionality, as well as being simple to compute. Of 
the strategies to achieve this that most readily suggest themselves, two broad 
types may be distinguished within the terms of our analysis: the ‘morphological’ 
(ie concerning the input-space of the folded Markov chain) and the ‘topologi- 
cal’ (concerned with the interconnections of the code indices at the higher levels 
of the sequence of folded Markov chains). The former thus seeks to determine 
factoriality by measuring certain aggregate parameters of the code index prob- 
ability distributions in the input space, whilst the latter determines the pattern 
of interconnectivity of the code indices themselves. However, topological mea- 
sures, while being arguably the truest measure of ‘factoriality’ (given that the 
term is not absolutely defined), have the disadvantage of not directly reflecting 
underlying metrical disposition, as well as, in some cases, requiring a global op- 
timisation algorithm to enumerate (for instance, a block-diagonal factorisation) . 
Morphological measures, on the other hand, are capable of implicitly capturing 
the topological aspect of factoriality, while more directly capturing its metrical 
aspects. Furthermore, morphology measures can naturally encompass the form 
of generalisation that is required to occur in response to additive noise, in partic- 
ular, the elongations of the code- vector kernels along the noisy dimensions (see 
[eg 3] for an illustration of this behaviour) . 

Favouring, then, the morphological approach to factoriality quantification, 
the most straightforward strategy is to take a density spectrum of the orienta- 
tions of the individual weights, which exhibits a very clear bimodality in relation 
to the two distinct behavioural classes of factorial and joint encoding open to 
the SVQ. We define this quantity formally as follows: 

2.1 Derivation of Factoriality Measure 

We denote a specific weight-vector by Wy, where the first index, i, denotes a 
particular ordinate of the input space and the second index, j, denotes the code 
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index with which the weight is associated. The hyper-length-normalised weight 
vector is therefore given as: 




Only this latter quantity shall be considered throughout the following; a conse- 
quence of our being primarily interested in the aggregate weight orientations. 

The weight orientations are thus distributed around & (d — l)-dimensional 
angular hyper-space, orientation specification requiring one fewer free parame- 
ter than the specification of hyper-position. We wish to derive an appropriate 
model for this distribution in terms of the d weight ordinates with which we are 
presented by the SVQ. The first task is therefore to determine a density-bin size 
in the ordinate space appropriate to angularly distributed vectors. 

We have from [4] that points uniformly distributed on the hypersphere 
normalised to unity, embedded within and then projected into {M < N) 
have radial probability distributions Pn,m{t) given by: 

PN,M{r) = - r2)(^-^-2)/2 (5) 



where: 



Cn.M = 



2^1) 

p{f)r{^^) 



So, for example, for fV = 3, M = 2 we obtain: 



^3.2 (r-) = 



2(l-r2)2 



Consequently, the appropriate bin size for our unidimensional (M = 1) sample 
of the d-dimensionally-embedded hypersphere is given, for small Ai, by: 



Ai{r) = Ax 



Pd,i{r) 

(1 - r2)(3-'^)/27riT(^^) 



2r(f) 



with Ax some constant such that Ax < 1 (smaller Ax’s giving greater accuracy 
at the expense of a lower linear sampling-rate). Thus, for uniformly distributed 
angular vectors, the 1 — D projection with bin-size Ax gives rise to a uniform 
distribution. 

Beyond this, we also require a coordinate transformation T between the 
density-normalised unidimensional space, x, and the radial-projection space, r. 
This is straight-forwardly given by the integral of the above as Ax 0: 

2T^(d-l)/2 I p (l 1 ^. 3 . 2 ')! 

T{x) = ^ (9) 

with 2 F 1 the inductively-derived generalised hypergeometric function. 
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Deviation probabilty vs Encoder type 




Fig. 1. 



The regularised density spectrum for weight-orientations with respect to the 
feature-ordinate Xi sampled at intervals Ax is thus: 

^ pX—Xi-\-Ax 

D,{Xi) = X 5(x-T-\uJ-))dx (10) 

i=i 

M i.r=T-\Xi)+Ax/P^,i{r) 

= ^/ S{r-uJ-)dr (11) 

^7r-=T-i(xd 

We illustrate the bimodal behaviour of this distribution with respect to joint 
and factorial encoding in figure 1 on our real-world data-set, selected to be 
broadly representative of the sorts of data-set commonly brought before the 
pattern recognition community for automated classification. It consists in a set 
of expertly-segmented geological survey images, with features determined via 
a battery of cell-based processes for texture characterisation, chosen without 
regard to the particular nature of the classification problem. Hence the data 
exhibits an approximate manifold structure of considerably smaller intrinsic di- 
mensionality than the total feature set cardinality. Of the original 26 features, a 
subset of three are consequently chosen to exemplify this manifold structure for 
the testing of the SVQ methodology. In particular, the data is sufficiently sparse 
for factorial and joint encoding to produce significantly different code vector 
probability distributions. In practical terms, to utilise the regularised collinear- 
ity distribution Di{Xi) as a factoriality measure, it is necessary to contract the 
information to a single value by taking the mean quantity: 
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E 



M 

~d 



{pjn ) : Pin) > J^} 



(12) 



which would correspond to manifold dimensionality in the case where equal 
numbers of code vectors are allocated per ordinate, and the manifold achieves 
optimal factorisation. 

In general, maximal factoriality requires that we minimise this quantity with 
respect to a fixed Euclidean error (Euclidean error solutions being degenerate 
with respect to factoriality). 



3 Factoriality as a Function 

of SVQ Bandwidth Parameterisation 

Having obtained an appropriate factoriality statistic, it is necessary to establish 
its relation to the various bandwidth parameters of the SVQ (which, when con- 
catenated, would correspond loosely to the ‘information content’ of the resulting 
pattern space description). The two critical measures of this SVQ ‘bandwidth’ 
are the number of code indices, M, and the number of stochastic samples, n, 
employed in the derivation of Pr{y\x), giving, respectively, the number of free 
morphological descriptors of the space, and an indication of their ‘resolution’. A 
plot of these parameters against the factoriality measure for a canonical 2-toroid 
manifold results in a graph of the type shown in figure 2a, which clearly demon- 
strates a maximum at 8 code indices that is free of dependency on the number 
of samples, provided that n > 10 (the method generally factorialising maximally 
at a lower figure when the codes are badly under-sampled). A graph of the full 
spectrum of weight orientation densities is shown in figure 2b, plotted against 
the code index parameter that we have thus established to be the predominating 
information constraint (on the proviso that a sufficient number of samples have 
been allocated). This result, critically, correlates with the analytic findings of 
[1] for the same data set, demonstrating that factoriality is the characteristic 
quantity defining the effectiveness with which the SVQ code disposition is suc- 
cessful in characterising the intrinsic qualities of the data. Identical results (not 
reported in this paper) are obtained for comparable empirical data. 



4 Investigation of Relationship between Factoriality 
and Classification Performance 

It is now possible to address the central issue of classification performance, and 
instigate an investigation via the two-stage supervised procedure. We shall there- 
fore test our assertion of the relation of the factoriality statistic to classification 
performance with reference to the simulated 2-toroid data-set. In particular, we 
shall specify a two-class case consisting in a pair of displaced 2-toroids separated 
by one radial unit across their major axis, performing cross-validation via an 
equivalently-sized set of test and training vectors. 
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SVQ facloriality for 2-toroidal data set 



Metric probabilty vs Number of codes 



Relative probabilty density 



Factoriality 





Two stage SVQ lower-layer factoriality for two displaced 2-tori 



Two stage SVQ classlfiation accuracy for two displaced 2-tori 





Results for this approach are set out in figures 3a and 3b, which collectively 
demonstrate an extremely strong (although non-linear) correspondence between 
the measures of classification performance and (lower-layer) factoriality, as mea- 
sured against the two information-bandwidth parameters. 

5 Non-parametric Factoriality: 

The Prospects for Employing Multi-layer SVQs as 
Generalised Bayesian Classifiers/Classifier-Combiners 

We have thus far modified SVQ behaviour by adjusting the bandwidth param- 
eter to match that of the underlying pattern-data’s manifold information con- 
tent. We should ultimately like to address the problem of maximising factoriality 
from the perspective of fixed parameter SVQs. How, then, might factoriality be 
induced without a priori tuning? The most straight-forward solution is to in- 
troduce an additional layer: we have already seen how a two-stage system can 
act as a supervised classification system, with the upper layer determining class 
allocations; however, it is also the case that the presence of an upper layer will 
act to encourage factoriality (which is to say, spontaneous classifier subdivision) 
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by virtue of decreasing the overall hyper-volume of the code-vector probabil- 
ity distribution. Although beyond the scope of this paper to elucidate fully, we 
should consequently like to set the stage for addressing the question of how the 
connection strengths both within and between layers in a multi-layer SVQ sys- 
tem could be made to respond to factorial analysis in a manner appropriate to 
maximising overall classification performance. That is to say, we would specif- 
ically like to assess how the second (or higher) SVQ layers could act as both 
classifier diversifiers (via the connection strength parameter), as well as classi- 
fier combiners (via the specific code vector topologies). In this way, it should be 
possible to re-envisage SVQs as a method for unifying feature-selection, classifier 
design and classifier combination within a single framework, and hence, by using 
our factoriality measure as a diagnostic tool, gaining an understanding of the 
appropriate balance between these component subsystems free of the a priori 
structural constraints inherant in other multi-classifier systems. 

6 Conclusion 

We have set out to give an indication of how the classification abilities of the 
stochastic vector quantisation methodology might be enhanced in terms of its 
capacity to factorise. Doing so involved, firstly, the construction of a robust 
and economic aggregate measurement of factoriality in terms of the net weight 
collinearity, and secondly, an investigation across the range of SVQ bandwidth 
parameters of the relationship between the degree of factoriality and the Bayesian 
classification performance with respect to simulated data sets. We thereby con- 
firmed both the presence of an intrinsic ‘manifold information content’ within 
the data (via the singular peak of the factoriality statistic) with respect to the 
single-layer SVQ bandwidth parameters, and also the correspondence of this 
peak with the maximal generalising ability of the classifier. 

We then went on to briefly consider a bandwidth-independent multilayer SVQ 
implementation, wherein the level of factoriality is achieved with respect to the 
connection strength parameter of a second (or higher) SVQ stage. In doing so, 
we were able to begin to re-envisage SVQs as a mechanism capable of sponta- 
neously evolving towards a combined classifier framework, it being argued that 
SVQs thus represent a uniquely natural test-bed for studying expert fusion and 
diversity issues. In particular, now that a suitable diagnostic measure has been 
derived with the potential to determine the interaction between classifier diver- 
sification and combination in situ, we are in a position to give substance to the 
notion that SVQs are sufficiently parametrically-free classifiers as to effectively 
be considered constrained only by training data morphology. 
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Abstract. Error Correcting Output Coding is a well established technique to 
decompose a multi-class classification problem into a set of two-class prob- 
lems. However, a point not yet considered in the research is how to apply this 
method to a cost-sensitive classification that represents a significant aspect in 
many real problems. In this paper we propose a novel method for building 
cost-sensitive ECOC multi-class classifiers. Starting from the cost matrix for 
the multi-class problem and from the code matrix employed, a cost matrix is ex- 
tracted for each of the binary subproblems induced by the coding matrix. As a 
consequence, it is possible to tune the single two-class classifier according to 
the cost matrix obtained and achieve an output from all the dichotomizers 
which takes into account the requirements of the original multi-class cost ma- 
trix. To evaluate the effectiveness of the method, a large number of tests has 
been performed on real data sets. The first experimental results show that the 
proposed approach is suitable for future developments in cost-sensitive applica- 
tion. 



1 Introduction 

A diffused technique to face a classification problem with many possible classes is to 
decompose the original problem into a set of two-class problems. The rationale of this 
approach relies on the stronger theoretical roots and better comprehension character- 
izing two class classifiers {dichotomizers) such as Perceptrons or Support Vector 
Machines. Moreover, with this method it becomes possible to employ in multi class 
problems some dichotomizers which are very effective in two-class problems but are 
not capable to directly perform multi-class classification. 

In this framework. Error Correcting Output Coding (ECOC) has emerged as a well 
established technique for many applications in the field of Pattern Recognition and 
Data Mining, mainly for its good generalization capabilities. In short, ECOC decom- 
position labels each class with a bit string {codeword) of length L, higher than the 
number of classes. The codewords are arranged as rows of a coding matrix, whose 
columns define each a two-class problem; thus, for each problem, the set of the origi- 
nal classes parts into two complementary superclasses. On such problems induced by 
the coding matrix, L dichotomizers have to be trained in the learning phase. In the 
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operating phase, the dichotomizers will provide a string of L outputs for each sample 
to be classified. The Hamming distance of such string from each of the codewords of 
the coding matrix is then evaluated and the class that corresponds to the nearest 
codeword is chosen. Usually, the codewords are chosen so as to have a high Ham- 
ming distance between each other; in this way, ECOC is robust to potential errors 
made by the dichotomizers. 

The reasons for the classification efficiency exhibited by ECOC seem to be the re- 
duction of both bias and variance [1] and the achievement of a large margin [2,3]. 
After the seminal paper by Dietterich and Bakiri [4], many studies have been pro- 
posed which have analyzed several aspects of ECOC such as the factors affecting the 
effectiveness of ECOC classifiers [5], techniques for designing codes from data [6], 
evaluations of coding and decoding strategies [3,7]. 

However a point not yet considered is how to devise an ECOC system for 
cost-sensitive classification. This is a significant point in many real problems such as 
automated disease diagnosis, currency recognition, speaker identification, and fraud 
detection in which different classification errors frequently have consequences of 
very different significance. For example, in systems for the automatic diagnosis, 
classifying a healthy patient as sick is much less critical than classifying a sick patient 
as healthy or misrecognizing the disease, because the first error can be successively 
corrected at the cost of a further analysis, while there could be no chance to recover 
the second error. This is also true for the correct classifications: the benefit obtained 
when recognizing the correct disease for a sick patient is much higher than the identi- 
fication of a healthy patient. For this reason, the classification systems used in such 
situations must take into account the different costs and benefits (collected in a cost 
matrix) which the different decisions can provide and thus should be tuned accord- 
ingly. In multi-class classifiers, such task is usually accomplished by modifying the 
learning algorithm used during the training phase of the classifier or by tuning the 
classifier after the learning phase. 

In this paper we propose a method for building cost-sensitive ECOC multi-class 
classifiers. Starting from the cost matrix for the multi-class problem and from the 
code matrix employed, a cost matrix is derived for each of the binary problems in- 
duced by the columns of the code matrix. In this way it is possible to tune the single 
dichotomizer according to the cost matrix obtained and achieve an output from the 
dichotomizers which takes into account the requirements of the original multi-class 
cost matrix. 

In the rest of the paper we present, after a short description of the ECOC approach, 
some issues about two-class cost sensitive classification; then we describe how to 
decompose the original multi-class cost matrix in more two-class cost matrices. A 
conclusive section describes the results obtained from experiments performed on real 
data set. 

2 The ECOC Approach 

The Error Correcting Output Coding has been introduced to decompose a multi-class 
problem into a set of complementary binary problems. Each class label is represented 
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by a bit string of length L, called codeword, with the only requirement that distinct 
classes are represented by distinct bit strings. If n is the number of the original 
classes, a code is a nxL matrix M = {M^J where g {-1,+1 }. Each row of M corre- 
sponds to a codeword for a class, while each column corresponds to a binary prob- 
lem. In this way, the multi-class problem is reduced to L binary problems on which L 
dichotomizers have to be trained. An example of coding matrix with n = 5 and L = 1 
is shown in table 1 . 



Table 1. A coding matrix for 5 classes and 7 dichotomizers 



dichotomizers 





1 


2 


3 


4 


5 


6 


7 


A 


+1 


-1 


+1 


-1 


+1 


+1 


-1 


B 


-1 


+1 


+1 


-1 


+1 


-1 


-1 


C 


+1 


-1 


+1 


+1 


-1 


+1 


-1 


D 


-1 


+1 


+1 


-1 


+1 


+1 


-1 


E 


-1 


-1 


-1 


+1 


+1 


-1 


+1 



However, each dichotomizer is learned from a finite set and thus, when classifying 
a new sample, its prediction could be wrong. This does not necessarily lead to an 
irrecoverable error in the multi-class problem since the code matrix is built by n dis- 
tinct binary strings of length L> n, so as to make the Hamming distance between 
every pair of strings as large as possible. In fact, the minimum Hamming distance d 
between any pair of codewords is a measure of the quality of the code, because the 
code is able to correct at least \{d -l)!2\ single bit errors. In this way, a single bit 

error does not influence the result, as it can happen when using the usual one-per- 
class coding, where the Hamming distance between each pair of strings is 2. 

To classify a new sample x, a vector of binary decisions is computed by applying 
each of the learned dichotomizers to x; to decode the resulting vector, i.e. to pass 
from the binary to the multi-class problem, the most common approach consists in 
evaluating the Hamming distances between the vector and the codewords of the ma- 
trix and choose for the nearest code word, i.e. for the minimum Hamming distance. 
Other decoding rules have been proposed which are based on a Least Squares ap- 
proach [7] or on the loss function employed in the training algorithm of the dichoto- 
mizer [3], but we will not consider them in this paper. 



3 Cost- Sensitive Dichotomizers 

Before analyzing cost- sensitive classification in the multi-class case, it is convenient 
to focus preliminarily on the two-class problem. 

In this case, a sample can be assigned to one of two mutually exclusive classes that 
can be generically called Positive (P) class and Negative (N) class. The set of samples 
classified as “positive” by the dichotomizer will contain some actually-positive sam- 
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pies correctly classified and some actually-negative samples incorrectly classified. 
Hence, two appropriate performance figures are given by the True Positive Rate 
(TPR), i.e. the fraction of actually-positive cases correctly classified, and by the False 
Positive Rate (FPR), given by the fraction of actually-negative cases incorrectly clas- 
sified as “positive”. In a similar way, it is possible to evaluate the True Negative Rate 
(TNR) and the False Negative Rate (FNR). It is worth noting that only two indices are 
actually necessary because the following relations hold: 

FNR=1-TPR TNR=1-FPR . (1) 

In cost sensitive applications, every decision taken by the classifier involves a cost 
which estimates the penalty (benefit) produced by an error (by a correct decision). In 
many applications the two kinds of error (false positive and false negative) are not 
equally costly as well as the value of the benefit obtained can depend on the class of 
the sample correctly identified. Hence, we have to consider a cost matrix similar to 
the one described in table 2. It is worth noting that, while CFN and CFP have positive 
values, CTP and CTN are negative costs since they actually represent a benefit. 



Table 2. Cost matrix for a two class problem 



Predicted Class 



True class 





N 


P 


N 


CTN 


CFN 


P 


CFP 


CTP 



With such assumptions, an estimate of the effectiveness of a dichotomizer working 
in a cost sensitive application can be given by the expected classification cost {EC) 
defined as: 

EC = p{P) ■ CFN ■ FNR + p{N) ■ CFP ■ FPR + 
p{P) ■ CTP ■ TPR + p(N) ■ CTN ■ TNR 

where p(P) and p(N) are the a priori probabilities of the positive and negative classes. 

Eq. (2) shows the most general formulation for the expected cost. It can be simpli- 
fied taking into account that results of the selection of the optimal decision are un- 
changed if each entry of the cost matrix is multiplied by a positive constant and/or is 
added by a constant [8]. Hence an equivalent form for the cost matrix in table 2 is 
given in table 3: 



Table 3. A cost matrix equivalent to the cost matrix in tab. 2 



True class 





N 


P 


N 


0 


CFN - CTN 
CFP - CTN 


P 


1 


CTP - CTN 
CFP - CTN 



Predicted 

Class 
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The corresponding expression for the expected cost is: 



rwAi - CTN CTP - CTN 

EC = piP) ■ FNR + p(N) ■ FPR + p(P) ■ • TPR 



CFP-CTN 



CFP-CTN 



(3) 



Actually, since FNR and TPR depend on each other, the expression can be further 
simplified: 



EC = p(P)- 



CFN-CTN 

CFP-CTN 



■ FNR+ p(N) ■ FPR+ p{P) ■ 



CTP -CTN 
CFP-CTN 



■(\-FNR) = 



CFN-CTP 

= p(P) ■ FNR + p(N) ■ FPR + p(P) ■ 

CFP-CTN 



CTP -CTN 
CFP-CTN 



(4) 



For the classification purposes, the last term can be neglected since it represents 
only an offset which does not affect the choice of the decision rule minimizing the 
expected cost. Therefore the corresponding cost matrix becomes: 



Table 4. Cost matrix for a two class problem 



Predicted 

Class 



Tme class 





N 


P 


N 


0 


CFN - CTP 
CFP-CTN 


P 


1 


0 



And, consequently, the expressions of the expected cost associated to the cost matri- 
ces shown in table 4 is: 



EC = p(N) ■ FPR + p(P) ■ FNR ■ p (5) 

We can thus realize that, in two-class problems, the cost matrix has actually only 

C'P’M — C'TP 

one degree of freedom given by the ratio p = . As a consequence, all 

CFP - CTN 

problems having equal p are equivalent, i.e. they have the same optimal decision 
rule. 

Some conditions must hold for the cost matrix to be realistic: p must be positive, 
higher than zero and finite. In fact, if p < 0 the minimization of the expected cost (see 
eq. 5) will lead to the trivial decision rule which assigns all the samples to the 
negative class. On the contrary, if p — the optimal decision rule will tend to 
classify all samples as positive. This implies that sign(CF7V-CrF)=sign(CF’P-C77V), 
CFN^ CTP and CFP^ CTN. 

4 Evaluating the Two-Class Costs from the Multi-class Costs 

Let us now consider a multi class problem with n classes to be reduced to L binary 
problems by using a nxL coding matrix where is the codeword for the 
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class h and is the label assumed by a sample belonging to the class h in the binary 
problem induced by the A:-th column. Moreover, let us assume that the costs of the 
multi-class problems are described by a nxn cost matrix C = {Q} where Q > 0 repre- 
sents the cost produced by assigning to the class j a sample actually belonging to the 
class i; the cost for a correct classification is null, i.e. Q = 0 Vi. 

For each column k, the n original classes are clustered into two classes, labelled 
with -1 and H-1, which can be identified with the class N and the class P introduced in 
the previous section. For this reason, let us define = {h \ M^ = -1} the set of 
classes labelled with -1 and F® = {h\ = -1-1} the set of classes labelled with -i-l in 

the A:-th binary problem. In an analogous way, let us call C^p and the cost 

produced in the operative phase by the dichotomizer trained on the k-\h problem 
when it erroneously assigns to F* a sample belonging to and vice versa. 

To establish the values of cfp and , let us consider which are the conse- 
quences on the multi-class problem of an error made by the fc-th dichotomizer. A 
false positive error moves one unit away from the true codewords containing a -1 in 
the A:-th position toward the erroneous codewords containing a H-1 in the same posi- 
tion. In particular, if r and s are two classes such that r g and s g F®, a false posi- 
tive error made by the A:-th dichotomizer on a sample belonging to r will move one 
unit from the correct codeword of r, M^„ toward the codeword of s, M^,. Let us call 
d(Mp,M^.) the Hamming distance existing between and Af if there were errors 
also on the other d(Mp,M^.)-l bits in which the two codewords differ, an error (with a 
cost equal to CJ would be generated in the multi-class problem. The contribution to 
such error given by the false positive produced by the k-ih dichotomizer can be hence 

estimated as ^ ; as a consequence, the cost of the false positive related to 

d(M,.,M^*) 

the possible misclassification between r and can be estimated as 

d(M^.,M„) 

Fig. 1 shows an example with reference to the coding matrix given in table 1. In par- 
ticular, we are considering a false positive produced by the 4th dichotomizer when 
assigning to class E a sample of class A; to this aim, let us remark that A/*"*’ collects 
classes A, B and D, while F*'’’ contains classes C and E. 

Actually, the false positive moves the toward all the codewords belonging to 
F® and thus the cost related to all the possible misclassifications involving the class r 
can be estimated as (see fig. 2): 

Eventually, we have to extend such evaluation to all the classes belonging to 
The conclusion is that the cost for a false positive made by the A:-th dichotomizer is 
related to the risk of misclassifying one of the classes belonging to with one from 
F® and an estimate of its value is: 
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erroneous eodewonl 



Fig. 1. Estimating the cost of a false positive produced by the 4th dichotomizer of cost matrix 
shown in table Iwhen assigning a sample of class A to class E. The cost is evaluated as C^J5 
where is the cost in the multi-class cost matrix and A{M^„Mp) = 5 



C.u. 



t 

Me- 




Me- 



Fig. 2. The false positive produced by the 4th dichotomizer on a sample of class A makes the 
codeword move toward F™, with a cost estimated as C^J5 + C^Jl. 



f'ik) _ 
^FP ~ 



Ji>5> 



( 7 ) 



Likewise, it is possible to estimate the cost for a false negative made by the k-th 
dichotomizer since it is related to the risk of misclassifying one of the classes belong- 
ing to P® with one from 



_ 



E E — — 



( 8 ) 



In this way, we can define for the k-th dichotomizer a cost matrix similar to the 
one shown in table 4 with cost ratio: 



E E — — 
' S Z 



( 9 ) 



It is easy to see that the conditions for a realistic cost matrix (i.e. 0 < p® < H-°o) are 
satisfied since Q > 0\/ i^j and A{M^,M,,) ^ 0 Vr s. 
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5 Experimental Results 



To evaluate the effectiveness of the proposed approach we have made several ex- 
periments on some data sets with different numbers of classes and dichotomizers with 
different architectures; moreover, a comparison technique has been devised to assure 
that the outcomes obtained were statistically significant. 

The data sets used are publicly available at the UCI Machine Learning Reposi- 
tory [9]; all of them have numerical input features and a variable number of classes. 
All the features were previously rescaled so as to have zero mean and unit standard 
deviation. More details of data sets are given in table 5. The table provides also the 
type of coding matrix used for each data set. We choose an exhaustive code [4] for 
those data sets that have a number of classes lower than 8 and a BCH code [4] for 
those having a number of classes greater or equal to 8. In particular, for Vowel and 
Letter Recognition data sets, we adopted, respectively, ECOC codes 15-11 and 63-26 
available at http://web.engr.oregonstate.edu/~tgd. 



Table 5. Data sets and coding matrices used in the experiments 



data Set 


classes 


feat. 


samples 


train. 

set 


test 

set 


valid. 

set 


coding 

matrices 


code 

length 


Ann-thyroid 


3 


21 


7200 


5040 


1080 


1080 


Exhaustive 


3 


Dermatology. 


6 


34 


358 


252 


54 


52 


Exhaustive 


31 


Glass 


6 


9 


214 


149 


32 


33 


Exhaustive 


31 


Sat Image 


6 


36 


6435 


4505 


965 


965 


Exhaustive 


31 


Segmentation 


7 


18 


2310 


1617 


350 


343 


Exhaustive 


63 


Optdigits 


10 


62 


5620 


3935 


844 


841 


BCH 31-21 


31 


Pendigits 


10 


16 


10992 


7696 


1647 


1649 


BCH 31-21 


31 


Vowel 


11 


10 


990 


693 


154 


143 


Diett 15-11 


15 


Letter Rec. 


26 


16 


20000 


14001 


2998 


3001 


Diett 63-26 


63 



The dichotomizers employed were Support Vector Machines implemented by 
means of SVMlight tool [10] available at http://svmlight.joachims.org. 

In order to build dichotomizers tuned on the cost matrices determined according 
the method described in Section 4, we have adopted a post-learning scheme [11] 
which evaluates a threshold f to be imposed on the output of the SVM, so as to attrib- 
ute the sample to be classified to the class N if the SVM output is less than t and to 
the class P otherwise. The threshold is chosen so as to minimize the expected classifi- 
cation cost on a validation set; this is an approach more general than the standard 
SVM setting which uses a zero threshold. 

The aim of the experiments was to verify if our method can give a real improve- 
ment and thus we have compared the classification costs obtained when the described 
method is applied (i.e. when the dichotomizers are tuned according to the cost matri- 
ces built as seen in Section 4) with the classification costs obtained by using the stan- 
dard ECOC architecture (i.e. the dichotomizers are employed without any 
cost-sensitive tuning). With reference to the value assumed for thresholding the SVM 
output, hereafter we will denote the first case as CST (Cost Sensitive Threshold) and 
the second one as ZT (Zero Threshold). 
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To avoid any bias in the comparison, 12 runs of a multiple hold-out procedure 
were performed on all data sets. In each run, the data set was split in three subsets: a 
training set (containing 70% of the samples of each class), a validation set and a test 
set (each containing 15% of the samples of each class); the final size of each of these 
sets is given in table 5. The validation set was used to evaluate the optimal threshold 
of the CST, while it was considered as part of the training set in the ZT method. 

After the encoding phase, we obtained a different training set for each SVM. Since 
in such training sets the distribution between the two classes was frequently very 
skewed, the sets were balanced [12] before training so as to have a more effective 
learning for the SVMs. The majority class was randomly undersampled so as to 
equate the number of samples in the two classes. 

The two classification costs to be compared were evaluated on the test set, thus ob- 
taining, for a given data set, 12 different values for each of the costs required. To 
establish if the classification cost obtained by the CST was significantly better than 
the cost of ZT, we used the Wilcoxon rank-sum test [13], that verifies if the mean of 
the CST is higher than, lower than or undistinguishable from the mean of the costs of 
ZT. All the results were provided with a significance level equal to 0.05. 

To obtain a result unbiased with respect to the particular cost values, we apply the 
approach proposed in [14]: a hundred of different cost matrices have been used 
whose elements were randomly generated according to a uniform distribution over 
the range [1,10]. For each cost matrix, the test before described has been repeated. 

In table 6 are presented the results of the evaluation of the CST scheme compared 
with ZT; the experiments were performed on SVMs with three different kernels: 
linear, 2°‘‘ degree polynomial and RBF. Each row of the table corresponds to a data 
set, while a group of three columns refers to the particular kernel used. For each ker- 
nel, the columns contains a value which indicates the number of runs (out of 100) for 
which the CST classification has produced a classification cost respectively undistin- 
guishable from, higher than or lower than the classification cost obtained with ZT. 



Table 6. Result of comparison between CST and ZT for different kernels 





Linear 


2-degree Polynomial 


RBF 


Ann-thyroid 


28 


5 


67 


0 


0 


100 


0 


0 


100 


Dermatology 


8 


92 


0 


74 


23 


3 


0 


0 


100 


Glass 


62 


38 


0 


36 


9 


55 


16 


1 


83 


Sat Image 


0 


100 


0 


22 


77 


1 


15 


3 


82 


Segmentation 


42 


58 


0 


57 


43 


0 


6 


0 


94 


Optdigits 


1 


99 


0 


19 


81 


0 


6 


0 


94 


Pendigits 


51 


44 


5 


44 


56 


0 


30 


18 


52 


Vowel 


53 


16 


31 


35 


65 


0 


30 


38 


32 


Letter Rec. 


16 


5 


79 


45 


55 


0 


53 


24 


23 



The results appear to be quite dependent on the kernel adopted for the SVM. In 
fact, while for the linear kernel the percentage of cases in which CST performs 
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equally or better than ZT is only 49.22%, this result grows to 54.56% for the 2-degree 
polynomial kernel and to 90.67% for the RBF kernel. Such dependence on the par- 
ticular dichotomizer used is a more general property of ECOC, as pointed out in [5]. 
The outcomes produced by the 2°‘‘ degree polynomial and by the RBF kernel can be 
explained by taking into account the higher complexity of such dichotomizers which 
are able to discriminate better than the linear dichotomizer. However, there are two 
situations, i.e. Vowel and Letter Recognition, in which this is not true. This cases 
could be due to overfitting phenomena which affect the performance of the 2°** degree 
polynomial and RBF kernels. 

In summary, the experiments show that the proposed method can achieve an im- 
provement in terms of classification cost when using an ECOC -based classification 
system in cost-sensitive applications. It remains to be defined which kind of dichoto- 
mizer is best suited for a particular application, but this is a more general, open prob- 
lem that involves the estimation of the complexity of the data characterizing the ap- 
plication [5]. 
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Abstract. In this paper we introduce a method of creating structural 
(i.e. graph-based) classifier ensembles through random node selection. 
Different fc-Nearest Neighbor classifiers, based on a graph distance mea- 
sure, are created automatically by randomly removing nodes in each 
prototype graph, similar to random feature subset selection for creating 
ensembles of statistical classifiers. These classifiers are then combined 
using a Borda ranking scheme to form a multiple classifier system. We 
examine the performance of this method when classifying a web doc- 
ument collection; experimental results show the proposed method can 
outperform a single classifier approach (using either a graph-based or 
vector-based representation). 



1 Introduction 

Classifiers are machine learning algorithms which attempt to assign a label (a 
classification) to objects based on comparisons to previously seen training exam- 
ples. One application of such a system might be to automatically categorize doc- 
uments into specified categories to allow for later browsing or retrieval [1] [2] [3] . 
The performance of classifiers is measured by their ability to accurately assign 
the correct class label to previously unseen data. In order to improve classifier 
accuracy, the idea of creating classifier ensembles has been proposed [4] [5] . This 
methodology involves combining the output of several different (usually inde- 
pendent) classifiers in order to build one large, and hopefully more accurate, 
multiple classifier system. Several approaches have been proposed to create clas- 
sifier ensembles. Bagging, for instance, creates classifiers by randomly selecting 
the group of training examples to be used for each classifier [6] . A similar idea is 
that of random feature subset selection [7]. In this method, we randomly select 
the features (dimensions) to be used for each feature vector to create a group of 
classifiers. 

Common to many classification algorithms, including those used in classifier 
ensembles, is that they are designed so as to work on data which is represented 
by vectors (i.e. sets of attribute values). However, using such vector representa- 
tions may lead to the loss of the inherent structural information in the original 
data. Consider, for example, the case of classifying web documents based on 
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their content. Under the vector space model of document representation [8] each 
term which may appear on a document is represented by a vector component 
(or dimension). The value associated with each dimension indicates either the 
frequency of the term or its relative importance according to some weighting 
scheme. A problem with representing web documents in this manner is that cer- 
tain information, such as the order of term appearance, term proximity, term 
location within the document, and any web specific information, is lost under 
the vector model. Graphs are a more robust data structure which are capable of 
capturing and maintaining this additional information. 

Until recently, there have been no mathematical frameworks available for 
dealing with graphs in the same fashion that we can deal with vectors in a 
machine learning system. For example, the fc-Nearest Neighbors (fc-NN) classi- 
fication algorithm requires the computation of similarity (or distance) between 
pairs of objects. This is easily accomplished with vectors in a Euclidean feature 
space, but until recently it has not been possible with graphs [9] [10] [11]. Given 
these new graph-theoretic foundations, a version of the /c-Nearest Neighbors al- 
gorithm which can classify objects which are represented by graphs rather than 
by vectors has been proposed [12] [13]. Experimental results comparing the clas- 
sification accuracy of the graph-theoretic method with traditional vector-based 
/c-NN classifiers showed that the performance when representing data by graphs 
usually exceeds that of the corresponding vector-based approach [12] [13]. 

In this paper, we introduce a technique for creating graph-based classifier 
ensembles using random node selection. To our knowledge, this is the first time 
such an approach has been taken to build structural classifier ensembles. In this 
paper we will also consider ensembles that include both structural, i.e. graph- 
based, and statistical, i.e. feature vector-based, classifiers. In [14] such an ap- 
proach, using one statistical and one structural classifier, has been proposed. 
However, the classifiers used in [14] were both designed by hand. By contrast, 
our method allows for automatic ensemble generation out of one single structural 
base classifier. We will perform experiments in order to test the accuracy of the 
classifier ensembles created with our novel procedure. The data set we use is a 
web document collection; the documents will be represented by both graphs and 
vectors and we will measure the accuracy of assigning the correct category to 
each document using the proposed method. 

The remainder of the paper is organized as follows. In Sect. 2 we will explain 
how the web documents are represented by graphs. The details of the graph- 
based fc-Nearest Neighbors algorithm are given in Sect. 3. In Sect. 4 we describe 
the method used to combine the fc-NN classifiers into an ensemble. We present 
the experimental results in Sect. 5. Finally, in Sect. 6, we give some concluding 
remarks. 



2 Graph Representation of Web Documents 

Gontent-based classification of web documents is useful because it allows users to 
more easily navigate and browse collections of documents [1][3]. However, such 
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classifications are often costly to perform manually, as this requires a human 
expert to examine the content of each web document and then make a deter- 
mination of its classification. Due to the large number of documents available 
on the Internet in general, or even when we consider smaller collections of web 
documents, such as those associated with corporate or university web sites, an 
automated system which performs web document classification is desirable in 
order to reduce costs and increase the speed with which new documents are 
classified. 

In order to represent web documents by graphs during classification, and thus 
maintain the information that is usually lost in a vector model representation, 
we will use the following method. First, each term (word) appearing in the web 
document, except for stop words such as “the”, “of”, and “and” which convey 
little information, becomes a node in the graph representing that document. This 
is accomplished by labeling each node with the term it represents. Note that we 
create only a single node for each unique word even if a word appears more than 
once in the text. Second, if word a immediately precedes word b somewhere in 
a “section” s of the web document, then there is a directed edge from the node 
corresponding to a to the node corresponding to b with an edge label s. We 
do not create an edge when certain punctuation marks (such as a period) are 
present between two words. Sections we have defined are: title, which contains 
the text related to the web document’s title and any provided keywords; link, 
which is text appearing in clickable hyperlinks on the web document; and text, 
which comprises any of the readable text in the web document (this includes 
link text but not title and keyword text). Next, we remove the most infrequently 
occurring words for each document by deleting their corresponding nodes, leaving 
at most m nodes per graph (m being a user provided parameter) . This is similar 
to the dimensionality reduction process for vector representations but with our 
method the term set is usually different for each document. Finally, we perform 
a simple stemming method and conflate terms to the most frequently occurring 
form by re-labeling nodes and updating edges as needed. An example of this 
type of graph representation is given in Fig. 1. The ovals indicate nodes and 
their corresponding term labels. The edges are labeled according to title (TI), 
link (L), or text (TX). The document represented by the example has the title 
“YAHOO NEWS” , a link whose text reads “MORE NEWS” , and text containing 
“REUTERS NEWS SERVICE REPORTS” . Note there is no restriction on the 
form of the graph and that cycles are allowed. While this appears superficially 
similar to the bigram, trigram, or N-gram methods [15], those are statistically- 
oriented approaches based on word occurrence probability models. The method 
described above does not require or use the computation of term probabilities. 

3 Graph-Theoretic fc-Nearest Neighbors 

In this section we describe the /c-Nearest Neighbors (/c-NN) classification algo- 
rithm and how we can easily extend it to work with graph-based data. The basic 
fc-NN algorithm is given as follows [16]. First, we have a set of training examples 
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Fig. 1. Example of a graph representation of a document 



(or prototypes) . In the traditional /c-NN approach these are numerical vectors in 
a Euclidean feature space. Each of these prototypes is associated with a label 
which indicates to what class the prototype belongs. Given a new, previously 
unseen instance, called an input instance, we attempt to estimate which class 
it belongs to. Under the fc-NN method this is accomplished by looking at the k 
training instances closest (i.e. with least distance) to the input instance. Here k 
is a user provided parameter and distance is usually defined to be the Euclidean 
distance. However, in information retrieval applications, the cosine similarity or 
Jaccard similarity measures [8] are often used due to their length invariance 
properties. 

Once we have found the k training instances nearest to the input instance 
using some distance measure, we estimate the class of the input instance by the 
majority held among the k training instances. This class is then assigned as the 
predicted class for the input instance. If there are ties due to more than one 
class having equal numbers of representatives amongst the nearest neighbors 
we can either choose one class randomly or we can break the tie with some 
other method, such as selecting the tied class which has the minimum distance 
neighbor. For the experiments in this paper we will use the latter method, which 
in our experiments has shown a slight improvement over random tie breaking. 

In order to extend the A:-NN method to work with graphs instead of vec- 
tors, we only need a distance measure which computes the distance between 
two graphs instead of two vectors, since both training and input instances will 
be represented by graphs. One such graph distance measure is based on the 
maximum common subgraph [9] : 



dist{Gi,G2) = 1 — 



\mcs{Gi,G2)\ 
max{\Gi\, IG 2 I) ' 



( 1 ) 



Here Gi and G 2 are graphs, mcs{Gi, G 2 ) is their maximum common subgraph, 
max{- ■ •) is the standard numerical maximum operation, and | • • • | denotes the 
size of the graph, which is taken in this context to be the number of nodes and 
edges in the graph. This distance measure has been shown to be a metric [9]. 
More information about graph-theoretic fc-NN classifiers and applications to web 
document classification can be found in [12] [13]. 
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4 Creating Ensembles of Structnral Classifiers 

In this section we describe our proposed method of creating graph-based clas- 
sifiers and combining them into an ensemble. As mentioned earlier, the goal of 
creating an ensemble of classifiers is to achieve a higher overall accuracy than 
any single classifier in the ensemble by combining the output of the individ- 
ual classifiers [4] [5] . The classifiers used in our ensembles perform the fc-Nearest 
Neighbors method described in Sect. 3. Different graph-based classifiers are gen- 
erated by randomly removing nodes (and their incident edges) from the training 
graphs until a maximum number of nodes is reached for all graphs. We create 
several graph-based classifiers using this method, and each becomes a classifier 
in the ensemble. 

For each classifier, we output the three top ranked classification labels. The 
ranked outputs from each classifier are then combined using a Borda count [17]. 
The first ranked class receives a vote of 3, the second a vote of 2, and the third a 
vote of 1. Using the Borda count we select the class with the highest total vote 
count as the predicted class for the ensemble, with ties broken arbitrarily. 

5 Experiments and Resnlts 

In order to evaluate the performance of our proposed method of creating classifier 
ensembles using random node selection, we performed several experiments on a 
collection of web documents. Our data set contains 185 documents, each belong- 
ing to one of ten categories, and was obtained from ftp.cs.umn.edu/dept/users/ 
boley/PDDPdata/. For our experiments, we created graphs from the original 
web documents as described in Sect. 2. For the parameter m, which indicates 
the maximum number of the most frequent nodes to retain in each graph, we 
used a value of 100 for our experiments. In addition to the graph-based classi- 
fiers, we also include a single vector-based fc-NN classifier in the ensemble. For 
the vector-based classifiers, we used a standard term-document matrix of 332 
dimensions that was provided with the data set; we used a distance based on 
the Jaccard similarity [8] for the vector-based classifiers. 

The experimental results are given in Table 1. Each row indicates an exper- 
iment with different parameter values, which are shown in the three leftmost 
columns. The parameters are NN, the maximum number of nodes to randomly 
be included in each prototype graph; NC, the number of classifiers in the ensem- 
ble; and k, which is the parameter k used in the /c-NN algorithm (the number of 
nearest neighbors) . Note that in a given ensemble there will be NC — 1 graph- 
based classifiers created through random node selection and a single vector-based 
classifier. 

The results of each experiment are shown in the next six columns as classi- 
fication accuracy measured by leave-one-out. Min is the accuracy of the worst 
single classifier in the ensemble. Similarly, Max is the accuracy of the best single 
classifier in the ensemble. Ens is the accuracy of the combined ensemble using 
Borda count as described above. Oracle is the accuracy if we assume that when 
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at least one individual classifier in the ensemble correctly classifies a document 
the ensemble will also correctly classify the document; this gives us an upper 
bound on the best possible accuracy of the ensemble if we were able to find an 
optimal combination method but leave the classifiers themselves as they are. 
BL (G) (baseline graph-based classifier) gives the accuracy of a single graph- 
based classifier using the standard A:-NN method with the full sized training 
graphs (to = 100) for a baseline comparison; similarly, BL {V) (baseline vector- 
based classifier) gives the accuracy of the vector-based /c-NN classifier used in 
the ensemble. The final column, Imp, is the difference between Ens and BL (G), 
which indicates the performance improvement realized by the ensemble over the 
baseline graph-based classifier. The average of each column is shown in the bot- 
tom row of the table. 



Table 1. Experimental results 



NN 


NC 


k 


Min 


Max 


Ens 


Oracle 


BL (G) 


BL (V) 


Imp 


50 


3 


1 


68.65% 


73.51% 


83.24% 


91.35% 


80.00% 


73.51% 


3.24% 


50 


3 


3 


70.27% 


74.59% 


82.16% 


92.43% 


81.62% 


74.59% 


0.54% 


50 


3 


5 


73.51% 


74.05% 


81.08% 


92.43% 


83.24% 


74.05% 


-2.16% 


50 


5 


1 


63.78% 


74.05% 


83.78% 


93.51% 


80.00% 


73.51% 


3.78% 


50 


5 


3 


72.43% 


78.38% 


82.16% 


94.59% 


81.62% 


74.59% 


0.54% 


50 


5 


5 


72.43% 


78.92% 


81.62% 


95.14% 


83.24% 


74.05% 


-1.62% 


50 


10 


1 


63.78% 


73.51% 


80.00% 


94.05% 


80.00% 


73.51% 


0.00% 


50 


10 


3 


68.11% 


79.46% 


81.08% 


95.68% 


81.62% 


74.59% 


-0.54% 


50 


10 


5 


70.81% 


80.54% 


81.62% 


95.14% 


83.24% 


74.05% 


-1.62% 


75 


3 


1 


72.97% 


76.76% 


84.32% 


91.35% 


80.00% 


73.51% 


4.32% 


75 


3 


3 


74.59% 


78.38% 


81.08% 


91.35% 


81.62% 


74.59% 


-0.54% 


75 


3 


5 


74.05% 


78.92% 


83.24% 


92.43% 


83.24% 


74.05% 


0.00% 


75 


5 


1 


70.81% 


74.05% 


82.16% 


91.35% 


80.00% 


73.51% 


2.16% 


75 


5 


3 


72.97% 


78.92% 


81.08% 


94.05% 


81.62% 


74.59% 


-0.54% 


75 


5 


5 


74.05% 


81.08% 


85.41% 


94.59% 


83.24% 


74.05% 


2.17% 


75 


10 


1 


68.65% 


76.76% 


80.00% 


90.81% 


80.00% 


73.51% 


0.00% 


75 


10 


3 


72.43% 


78.38% 


82.70% 


95.14% 


81.62% 


74.59% 


1.08% 


75 


10 


5 


74.05% 


81.08% 


83.24% 


95.14% 


83.24% 


74.05% 


0.00% 


Average 


71.02% 


77.30% 


82.22% 


93.36% 


81.62% 


74.05% 


0.60% 



The first thing we observe from the results is that, in every case, the accuracy 
of the ensemble {Ens) was greater than the best single classifier in the ensemble 
{Max). Additionally, the ensemble accuracy {Ens) was always greater than the 
baseline vector classifier {BL (R)). We note also that the best accuracy attained 
by our ensemble method was 85.41% (for NN = 75, NC = 5, k = 5), while 
the best accuracy achieved by the graph-based baseline classifiers {BL (G)) was 
83.24% (for k = 5); out of BL (G), BL {V), and Ens, Ens attained both the 
highest average and highest maximum accuracy. Note that the performance of 
BL (G) and BL {V) is dependent only on the parameter k. The results further- 
more show that the ensemble was an improvement over the baseline graph-based 
classifier in 8 out of 18 cases; conversely, BL (G) was better than Ens in 6 out 
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of 18 cases. However, Oracle, in all cases, was much better than Ens (the maxi- 
mum oracle accuracy was 95.68%). This suggests one should look at improving 
the combination scheme. This could be done, for example, by altering our cur- 
rent method of Borda ranking to include more rankings or better tie breaking 
procedures. Alternatively we could introduce a weighted voting scheme, where 
classifier weights are determined by methods such as genetic algorithms [18]. One 
could also take the distances returned by our fc-NN classifiers into account. The 
number of classifiers (NC) and the parameter k did not seem to affect ensemble 
accuracy in any specific way. 



6 Conclusions 

In this paper we introduced the novel concept of creating structural classifier 
ensembles through random node selection. This is a method similar to random 
subset feature selection, but applied to (structural) classifiers that deal with data 
represented by graphs rather than vectors. The accuracy of the ensembles created 
using this method was slightly better, on average, than the accuracy of a baseline 
single classifier when classifying a collection of web documents; the best accuracy 
achieved by our method was also greater than the best accuracy of any baseline 
classifier. In addition, the overall ensemble accuracy was an improvement over 
the best individual classifier in the ensemble in all cases, and the oracle accuracy 
(which measures accuracy achieved when using an optimal combination strategy) 
was an improvement over the ensemble and baseline accuracies in all cases. 

The work described in this paper is, to the knowledge of the authors, the first 
on classifier ensembles in the domain of structural pattern recognition. Our fu- 
ture work will be directed toward examining the effect of other parameters which 
were not considered in the experiments presented here, such as the maximum 
number of nodes after dimensionality reduction (m) and the number of nodes 
used for random node selection. We also plan to vary the number of classifiers 
in the ensemble over a wider ranger of values. As we saw in our experiments, 
examining the accuracy of the ensemble as if it were an oracle showed a signifi- 
cant potential improvement in performance. This is a strong motivation to look 
at further refining the classifier combination method. 

There are many other open issues to explore as well. For example, instead of 
random node selection, we could select nodes based on some criteria, such as their 
degree (i.e., the number of incident edges) or a particular structural pattern they 
form with other nodes in the graph (chains, cliques, etc.). We have previously 
experimented with various other graph representations of documents [19], such 
as those that capture frequency information about the nodes and edges; using 
these representations in the context of ensemble classifiers will be a subject of 
future experiments. Ensemble performance could perhaps be further improved 
by analyzing the documents on which the baseline graph classifier outperformed 
the ensemble. We can also create different graph-based classifiers for an ensemble 
by changing the graph-theoretic distance measures used or through more well- 
known techniques such as bagging. 
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Abstract. We experimentally evaluated bagging and six other rando- 
mization-based ensemble tree methods. Bagging uses randomization to 
create multiple training sets. Other approaches, such as Randomized 
C4.5, apply randomization in selecting a test at a given node of a tree. 
Then there are approaches, such as random forests and random sub- 
spaces, that apply randomization in the selection of attributes to be 
used in building the tree. On the other hand boosting incrementally 
builds classifiers by focusing on examples misclassified by existing clas- 
sifiers. Experiments were performed on 34 publicly available data sets. 
While each of the other six approaches has some strengths, we find that 
none of them is consistently more accurate than standard bagging when 
tested for statistical significance. 



1 Introduction 

Bagging [1], Adaboost.MlW [2-4], three variations of random forests [5], one 
variation of Randomized C4.5 [6] (which we will call by the more general name 
“random trees”), and random subspaces [7] are compared. With the exception 
of boosting, each ensemble creation approach compared here can be distributed 
in a simple way across a set of processors. This makes them suitable for learning 
from very large data sets [8-10] because each classifier in an ensemble can be 
built at the same time if processors are available. The classification accuracy 
of the approaches was evaluated through a series of 10-fold cross validation 
experiments on 34 data sets taken mostly from the UC Irvine repository [11]. 
We used the open source software package “OpenDT” [12] for learning decision 
trees in parallel, which outputs trees very similar to C4.5 release 8 [13], and has 
added functionality for ensemble creation. 

Our previous results [14] showed that each of the ensemble creation tech- 
niques gives a statistically significant, though small, increase in accuracy over a 



F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 223—232, 2004. 
@ Springer- Verlag Berlin Heidelberg 2004 




224 



Robert E. Banfield et al. 



single decision tree. However, in head-to-head comparisons with bagging, none 
of the ensemble building methods was generally statistically significantly more 
accurate. 

This work extends our previous work [14] by comparing against the results 
for boosting, using ANOVA to better understand the statistical significance of 
the results, and increasing the number and size of the data sets used. We have 
also increased the number of classifiers in each ensemble, with the exception of 
boosting, to 1000. By using a greater number of decision trees in the ensemble 
and larger size data sets, our conclusions about several of the methods have 
changed. 

2 Ensemble Creation Techniques Evaluated 

Ho’s random subspace method of creating a decision forest utilizes the random 
selection of attributes or features in creating each decision tree. Ho used a ran- 
domly chosen 50% of the attributes to create each decision tree in an ensemble 
and the ensemble size was 100 trees. 

The random subspace approach was better than bagging and boosting for 
a single train/test data split for four data sets taken from the stat log project 
[15]. Ho tested 14 other data sets by splitting them into two halves randomly. 
Each half was used as a training set with the other half used as a test set. 
This was done 10 times for each of the data sets. The maximum and minimum 
accuracy results were deleted and the other eight runs were averaged. There 
was no evaluation of statistical significance. The conclusion was that random 
subspaces was better for data sets with a large number of attributes. It was not 
as good with a smaller number of attributes and a small number of examples, 
or a small number of attributes and a large number of classes. This approach is 
interesting for large data sets with many of attributes because it requires less 
time and memory to build each of the classifiers. 

Breiman’s random forest approach to creating an ensemble also utilizes a 
random choice of attributes in the construction of each CART decision tree [16, 
5]. However, a random selection of a subset of attributes occurs at each node 
in the decision tree. Potential tests from these random attributes are evaluated 
and the best one is chosen. So, it is possible for each of the attributes to be 
utilized in the tree. The number of random attributes chosen for evaluation at 
each node is a variable in this approach. Additionally, bagging is used to create 
the training set for each of the trees. We utilized random subsets of size 1, 2 and 
[log 2 (n) -I- IJ , where n is the number of attributes. 

Random forest experiments were conducted on 20 data sets and compared 
with Adaboost on the same data sets in [5]. Ensembles of 100 decision trees 
were built for the random forests and 50 decision trees for Adaboost. For the 
zip-code data set 200 trees were used. A random 10% of the data was left out of 
the training set to serve as test data. This was done 100 times and the results 
averaged. The random forest with a single attribute randomly chosen at each 
node was better than Adaboost on 11 of the 20 data sets. There was no evaluation 
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of statistical significance. It was significantly faster to build the ensembles using 
random forests. 

Dietterich introduced an approach which he called Randomized C4.5 [6], 
which comes under our more general description of random trees. In this ap- 
proach, at each node in the decision tree the 20 best tests are determined and 
the actual test used is randomly chosen from among them. With continuous at- 
tributes, it is possible that multiple tests from the same attribute will be in the 
top 20. 

Dietterich experimented with 33 data sets from the UC Irvine repository. For 
all but three of them a 10-fold cross validation approach was followed. The best 
result from a pruned or unpruned ensemble was reported. Pruning was done with 
a certainty factor of 10. The test results were evaluated for statistical significance 
at the 95% confidence level. It was found that Randomized C4.5 was better than 
C4.5 14 times and equivalent 19 times. It was better than bagging with C4.5 6 
times, worse 3 times and equivalent 24 times. From this, it was concluded that 
the approach tends to produce an equivalent or better ensemble than bagging. It 
has the advantage that you do not have to create multiple instances of a training 
set. 

3 Algorithm Modifications 

We describe our implementation of random forests and a modification to Diet- 
terich’s randomized C4.5 method. In OpenDT, like C4.5, a penalty is assessed 
to the information gain of a continuous attribute with many potential splits. In 
the event that the attribute set randomly chosen provides a “negative” infor- 
mation gain, our approach is to randomly re-choose attributes until a positive 
information gain is obtained, or no further split is possible. This enables each 
test to improve the purity of the resultant leaves. This approach was also used 
in WEKA [17]. 

We have made a modification to the Randomized C4.5^ ensemble creation 
method in which only the best test from each attribute is allowed to be among 
the best set of twenty features from which one is randomly chosen. This allows 
a greater chance of discrete attributes being chosen for testing when there are a 
large number of continuous valued attributes. We call it random trees B. 

4 Experimental Results 

We used 34 data sets, 32 from the UC Irvine repository [11], credit-g from 
NIAAD (www.liacc.up.pt/ML) and phoneme from the ELENA project. The data 
sets, described in Table 1, have from 4 to 69 attributes and the attributes are 

^ On a code implementation note, we allow trees to be grown to single example leaves, 
which we call pure trees. MINOBJS is set to one (which means a test will be at- 
tempted any time there are two or more examples at a node), tree collapsing is not 
allowed and dynamic changes in the minimum number of examples in a branch for 
a test to be used are not allowed. 
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Table 1. Description of data sets attributes and size. 



Data Set 


^ attributes 


^ continuous attributes 


^ examples 


# classes 


anneal 


38 


6 


898 


6 


audiology 


69 


0 


226 


24 


autos 


25 


15 


205 


7 


breast-w 


9 


9 


699 


2 


breast-y 


9 


0 


286 


2 


credit-a 


15 


6 


690 


2 


credit-g 


20 


7 


1000 


2 


glass 


9 


9 


214 


7 


heart-c 


13 


5 


303 


2 


heart-h 


13 


5 


294 


2 


heart-s 


13 


5 


123 


2 


heart-v 


13 


5 


200 


2 


hepatitis 


19 


6 


155 


2 


horse-colic 


22 


8 


368 


2 


hypo 


25 


7 


3163 


2 


ion 


34 


34 


351 


2 


iris 


4 


4 


150 


3 


krkp 


36 


0 


3196 


2 


labor 


16 


8 


57 


2 


led-24 


24 


0 


5000 


10 


letter 


16 


16 


20000 


26 


lymph 


18 


3 


148 


4 


page 


10 


10 


5473 


5 


pendigits 


16 


16 


10992 


10 


phoneme 


5 


5 


5404 


2 


pima 


8 


8 


768 


2 


primary 


17 


0 


339 


22 


satimage 


36 


36 


6435 


7 


sick 


29 


7 


3772 


2 


sonar 


60 


60 


208 


2 


soybean 


35 


0 


683 


19 


vehicle 


18 


18 


846 


4 


voting 


15 


0 


435 


2 


waveform 


21 


21 


5000 


3 



a mixture of continuous and nominal values. The ensemble size was 50 for the 
boosting approach, and 1000 trees for each of the other approaches. While 1000 
trees are more than the original authors suggested, this number was chosen so 
that more than enough trees were present in the ensemble. Unlike boosting, 
Breiman has argued these other techniques do not overfit as more classifiers are 
added to the ensemble [5]. 

For the random trees B approach, we used a random test from the 20 at- 
tributes with maximal information gain. In the random subspace approach of 
Ho, half ([n/2]) of the attributes were chosen each time. For the random for- 
est approach, we tested using a single attribute, 2 attributes and [log 2 n -I- IJ 
attributes (which will be abbreviated as Random Forests-lg in the following). 
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For each data set, a 10-fold cross validation was done. For each fold, an en- 
semble is built by each method and tested on the held out data. This allows for 
statistical comparisons between approaches to be made. Each ensemble consists 
solely of unpruned trees. ANOVA was first used to determine which data sets 
showed statistically significant differences at a specified confidence level. Sub- 
sequently, a paired t-test was used to determine, at the same confidence level, 
whether a particular ensemble approach is better or worse than bagging. 

Table 2 shows the comparative results at the 99% confidence interval. All 
ensemble creation techniques to the right of the bagging column in the table 
utilized bagging in creating training sets. For 30 of the 34 of the data sets none 
of the ensemble approaches could produce a statistically significant improvement 
over bagging. On two data sets, all techniques showed accuracy improvement (as 
indicated by a boldface number). The best ensemble building approaches by a 
slight margin were random forests-lg and random forests-2 which are better 
than bagging four times. Both of these methods are worse than bagging only 
twice (indicated by a number in italics). Boosting had the least wins at two, 
while losing to bagging once. Random subspaces lost to bagging four times and 
registered only three wins. Random trees B and random forests- 1 were better 3 
times and worse 1 time and 2 times respectively. 

We can create a summary score for each ensemble algorithm by providing 
1 point for a win, and 1/2 point for a tie. At the 99% confidence level the 
top performing ensemble methods are random forests-lg, random forests-2 and 
Random Trees B (18 points). All other approaches score 17.5 points or 16.5 
(random subspaces). 

An interesting question is how would these approaches rank if the average 
accuracy, regardless of significance, was the only criterion. Once again that ran- 
dom forests-lg and random forests-2 appear the best (25.5 and 24 points respec- 
tively). Random subspaces (22.5 points) performs much better in this study, 
beating random forests-1 (21 points), boosting (18 points), and random trees B 
(17.5 points). It is worth noting that all scores are above 17 which means they 
are each better than growing a bagged ensemble on average. Clearly, utilizing 
statistical significance tests changes the conclusions that one would make given 
these experimental results. 

To complete our investigation of the performance of these ensemble creation 
techniques we list the average accuracy results over ten folds and provide a 
Borda count. The Borda count [18] is calculated by assigning a rank to each of 
the proposed methods (first place, second place, etc.). The first place method 
obtains N points, second place takes N — 1 points, and so on, where N is the 
number of methods compared. The sum of those values across all data sets is 
the Borda count, as shown in Table 2, where greater values are better. 

Again we see random forests-lg and random forests-2 taking the lead, with 
Borda counts of 167 and 166, respectively. Random trees B and random sub- 
spaces obtain the next best scores of 152 and 150. It is difficult to say how many 
points constitutes a “significant” win, however boosting (128), bagging (118), 
and random forests-1 (117) certainly have a non-trivial number of points less. 
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Table 2. The average raw accuracy results. Boldface indicates statistically significantly 
better than bagging at the 99% confidence interval and italics means significantly worse. 
Results of a Borda count are provided with higher values signifying better performance. 
Summary scores are also reported. 



Data set 


Boosting 


Random 

Subspaces 


Random 
Trees B 


Bagging 


Random 

Forests-lg 


Random 
Forests- 1 


Random 

Forests-2 


anneal 


99.33 


99.78 


99.78 


99.22 


99.33 


99.67 


99.78 


audiology 


79.57 


81.74 


78.26 


80.43 


81.74 


76.09 


78.26 


autos 


87.76 


89.74 


86.79 


86.76 


85.86 


81.38 


84.83 


breast-w 


96.85 


96.99 


96.28 


95.99 


96.85 


97.14 


96.99 


breast-y 


68.60 


73.17 


71.05 


74.53 


73.14 


74.23 


74.18 


credit-a 


86.23 


86.81 


83.48 


86.09 


86.82 


86.96 


86.96 


credit-g 


75.10 


76.70 


74.20 


74.60 


76.90 


73.70 


75.90 


glass 


77.27 


77.73 


79.55 


74.55 


75.91 


78.18 


79.09 


heart-c 


81.53 


83.17 


81.83 


77.85 


82.80 


83.80 


83.45 


heart-h 


79.24 


82.62 


79.92 


79.92 


80.60 


81.29 


80.93 


heart-s 


90.26 


93.53 


89.43 


90.26 


91.93 


92.69 


91.93 


heart-v 


71.00 


75.00 


72.00 


72.50 


77.50 


77.50 


76.50 


hepatitis 


83.79 


82.46 


86.37 


83.83 


85.67 


85.04 


85.04 


horse-colic 


80.96 


83.96 


84.50 


85.86 


85.86 


84.23 


85.59 


hypo 


98.74 


98.80 


98.99 


99.15 


99.02 


98.92 


98.99 


ion 


95.28 


93.89 


93.61 


93.61 


93.61 


93.89 


93.89 


iris 


94.67 


94.00 


94.67 


94.67 


94.67 


94.00 


94.00 


krkp 


99.56 


95.75 


98.72 


99.66 


99.47 


97.94 


99.13 


labor 


85.67 


84.00 


89.00 


84.00 


89.33 


94.33 


94.33 


led-24 


71.43 


69.44 


72.41 


73.57 


74.93 


74.27 


74.77 


letter 


96.74 


97.03 


96.44 


94.90 


96.84 


95.66 


96.81 


lymph 


82.67 


80.00 


86.67 


78.67 


84.00 


84.00 


85.33 


page 


96.39 


97.21 


97.34 


97.19 


97.32 


97.21 


97.39 


pendigits 


99.21 


99.30 


99.25 


98.59 


99.25 


99.02 


99.14 


phoneme 


91.46 


83.70 


90.37 


91.42 


91.26 


91.02 


91.35 


pima 


74.29 


74.55 


76.49 


76.75 


76.75 


75.45 


75.97 


primary 


35.73 


45.75 


44.57 


40.74 


45.16 


44.56 


46.35 


sat 


91.89 


92.19 


92.24 


91.06 


92.08 


91.26 


91.72 


sick 


98.91 


96.29 


98.86 


98.94 


98.49 


97.96 


98.17 


sonar 


81.43 


82.38 


85.24 


77.14 


81.90 


82.38 


83.81 


soybean 


91.36 


95.31 


92.38 


92.53 


94.14 


94.00 


93.41 


vehicle 


76.94 


75.76 


74.59 


74.35 


74.59 


73.53 


74.47 


vote 


95.23 


95.45 


94.77 


95.91 


95.91 


95.00 


96.14 


waveform 


84.21 


85.27 


85.55 


84.01 


85.01 


85.59 


85.41 


Borda count 


108 


144 


134 


114 


167 


131 


166 


Summary 

Better 


2 


3 


3 




4 


3 


4 


Worse 


1 


4 


1 




2 


2 


2 


Same 


31 


27 


30 




28 


29 


28 


Score 


17.5 


16.5 


18 




18 


17.5 


18 
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5 Discussion 

5.1 Random Forests and Bagging 

Since the random forest approach uses bagging to create the training sets for 
the trees of their ensembles, one might expect that the two algorithms share the 
same wins and losses while the other methods do not. This turns out not to be 
the case. On the krkp data set, random forests are statistically significantly less 
accurate than bagging, as are random subspaces and random trees. Likewise in 
the three cases where each version of random forests is statistically significantly 
more accurate than bagging, random subspaces and random trees are also more 
accurate; on two of those data sets boosting is more accurate. An interesting 
experiment would be to measure the accuracy of random forests without bagging 
the training set since this could lead to a decrease in the running time. Random 
forests are already much faster than bagging since fewer attributes need to be 
tested at every possible node in the tree. 

5.2 Comparison against Prior Results in the Literature 

Our accuracy results compare with those published by Breiman in [5] for both 
boosting and random forests. Of the 12 data sets common to each work, our 
implementation of random forests- 1 is more accurate six times and our imple- 
mentation of boosting is more accurate seven times than what is shown in [5] . 
The accuracy differences on data sets are small, possibly influenced by the use 
of two different splitting criteria functions (the gini index for CART and infor- 
mation gain ratio for OpenDT). 

In Dietterich’s Randomized C4.5 experiments, he chose to report the best 
of the pruned (C4.5 certainty factor of 10) and unpruned ensembles, arguing 
that the decision to prune might always be correctly determined by doing cross 
validation on the training set. Of the 13 data sets for which Dietterich chose to 
use unpruned trees and which appear in our paper, the OpenDT implementation 
was more accurate ten times. This difference is likely a result of the OpenDT im- 
plementation splitting randomly amongst the top twenty best attributes, rather 
than among attribute tests. 

In [19], Ho carried out extensive comparison experiments between random 
subspaces and bagging on two class data sets that have more than 500 examples. 
The classifier utilized was an oblique decision tree. The data sets were randomly 
split in half 10 times and 10 experiments building 100 trees for the ensemble 
were done to get different accuracy values for statistical comparison purposes. 
Ho characterized the data sets for which random subspaces was statistically 
significantly better and the data sets for which bagging was better using a set 
of metrics. The average ratio of examples to features was about 93 for the data 
sets for which random subspaces outperform bagging. For the data sets in which 
we found significant differences, the minimum ratio of examples to features was 
88 and all others were above 125. However, for some of these data sets bagging 
was better than any of the other approaches. Evaluating these data sets on the 
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rest of the metrics used would be interesting future work (with the caveat that 
some of these data sets are multi-class) . 

Another result of note from this work is that significant differences between 
ensemble classifiers were only found for data sets with more than 3000 examples. 
Previously [10], it was found that building an ensemble of classifiers from disjoint 
subsets of the data could be as accurate or more accurate than bagging for large 
data sets. One might consider that classifiers would be more diverse [20] if built 
from disjoint subsets of data than built with bagging if the data was dense. A 
diverse set of classifiers can, sometimes, have a beneficial effect on accuracy. 



5.3 Other Ensemble Methods vs. Bagging 

Given previous results [7, 5] where random forests and random subspaces were 
better than boosting on data sets without noise and often better than bagging, 
we expected one or more approach might be very often better in a very rigorous 
statistical test. This was not the case (even for 95% where RF-1 does somewhat 
better but there are no changes in other rankings vs. bagging). Of the 34 data sets 
examined, the maximum statistically significant wins was 4 (with 2 losses). While 
the Borda count shows that, given several data sets, techniques such as random 
forests can show an accuracy improvement over bagging, for any particular data 
set, this accuracy improvement is not reliable. 

There are other potential benefits aside from increased accuracy perfor- 
mance though. Random forests, by picking only a small number of attributes 
to test, generates trees very rapidly. Random subspaces, which also tests fewer 
attributes, can also use much less memory because only the chosen percentage of 
attributes needs to be stored. Recall that since random forests may potentially 
split on any attribute, it must store all the data. Since random trees do not 
need to make and store new training sets, they save a small amount of time and 
memory over the other methods. 

Boosting has the unique ability to specialize itself on the hard to learn ex- 
amples. Unfortunately this makes the algorithm highly susceptible to noise. As 
several of the data sets used here are known to be noisy, boosting is at a disad- 
vantage in these experiments. If led-24, a synthetic data set with artificial noise, 
is removed from the experiment, boosting would have zero losses at the 99% 
confidence interval. 

Finally, random trees and random forests can only be directly used to create 
ensembles of decision trees. As with bagging, both boosting and random sub- 
spaces can be utilized with other learning algorithms such as neural networks. 



6 Summary 



This paper compares several methods of building ensembles of decision trees. A 
variant of the randomized C4.5 method introduced by Dietterich [6] (which we 
call random trees B), random subspaces [7], random forests [5], Adaboost.MlW, 
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and bagging are compared. All experiments used a 10-fold cross validation ap- 
proach to compare average accuracy. The accuracy of the various ensemble build- 
ing approaches was compared with bagging using OpenDT to build unpruned 
trees. The comparison was done on 34 data sets with 32 taken from the UC 
Irvine repository [11] and the others publicly available. 

The ensemble size was 1000 trees for each of the ensemble creation tech- 
niques except boosting which used 50. The ensemble size of boosting was chosen 
to match what had been used in previous work [5] . The ensemble size of the re- 
maining techniques was chosen to enable ensembles to reach maximum accuracy. 
Statistical significance tests in conjunction with ANOVA were done to determine 
whether each of the ensemble methods was statistically significantly more than 
or less accurate than bagging. 

No approach was unambiguously always more accurate than bagging. Ran- 
dom forests generally have better performance, and are much faster to build. 
The accuracy of the random subspace approach fluctuated, showing mediocre 
results statistically, but fairly good results generally. It is notable that a random 
forest built utilizing only two randomly chosen attributes for each test in the 
decision tree was among the most accurate classification methods. 
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Abstract. Atypical observations, which are called outliers, are one of difficul- 
ties to apply standard Gaussian density based pattern classification methods. 
Large number of outliers makes distribution densities of input features multi- 
modal. The problem becomes especially challenging in high-dimensional fea- 
ture space. To tackle atypical observations, we propose multiple classifiers sys- 
tems (MCSs) whose base classifiers have different representations of the 
original feature by transformations. This enables to deal with outliers in differ- 
ent ways. As the base classifier, we employ the integrated approach of statisti- 
cal and neural networks. This consists of data whitening and training of single 
layer perceptron (SLP). Data whitening makes marginal distributions close to 
unimodal, and SLP is robust to outliers. Various kinds of combination strategies 
of the base classifiers achieved reduction of generalization error in comparison 
with the benchmark method, the regularized discriminant analysis (RDA). 



1 Introduction 

In many real-world practical pattern recognition tasks including printed and handwrit- 
ten character recognition, we often meet atypical observations, and also meet the 
classification problem of such observations with Gaussian classifiers. Outliers are the 
observations which follow another distribution. If the number of outliers is large, the 
distributions could be multimodal ones. Applying Gaussian model to multimodal 
distribution produces many outliers. 

To deal with multimodal distributions, nonparametric (local) pattern recognition 
methods such as fe-NN rule and Parzen window classifier could be used because they 
approach the Bayes classifier with large training samples. However, in high- 
dimensional and small sample cases, sample size/complexity ratio becomes low. In 
such situations, utilization of nonparametric methods is problematic [1-3]. 

To reduce influences of atypical observations, we suggest multiple classifier sys- 
tems (MCSs) whose several base classifiers have different representations of the 
original feature. We perform different transformations of the original feature (includ- 
ing no transformation) in order to deal with outliers in different ways. 

As the base classifier, we employ the integrated approach of statistical and neural 
networks. This approach is the combination of data whitening and training of single 
layer perceptron (SLP) to recognize patterns. In data whitening, we also perform data 
rotation to achieve good start, speed up the SLP training, and obtain new features 
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whose marginal distribution densities are close to unimodal ones and often resemble 
Gaussian distribution. In one base classifier, for data rotation we utilized robust esti- 
mates of mean vectors and pooled covariance matrix. The SLP based classifier is 
inherently robust to outliers. 

We considered various kinds of combination strategies of the base classifiers in- 
cluding linear and non-linear fusion rules. We compare their performances with the 
regularized discriminant analysis (RDA) [2-4] as a benchmark method. RDA is one of 
the most powerful statistical pattern classification methods. 

To test our theoretical suggestions, we considered important task of recognition of 
handwritten Japanese characters. In handwritten Japanese character recognition, some 
of the classes can be easily discriminated. However, there are many very similar 
classes, and recognition of such similar classes is important but difficult problem. To 
improve this situation, our concern is to study most ambiguous pairs of pattern 
classes. For illustration, eight pairs of similar Japanese characters are shown in Fig. 1. 





^ * 


(t m 


¥ ¥ 


t±tf rtp 

m * 




U XI 


If 



Fig. 1. Eight pairs of similar Japanese characters. 



2 Sample Size/Complexity Properties 



The standard Fisher discriminant function (we call “discriminant function” DF in 
short) is one of the most popular decision rules. Let jc*" and be sample mean 
vectors, and S be pooled sample covariance matrix. Allocation of a p-variate vector 
jc = (xj , ■ ■ • , is performed according to sign of DF [2] 



- (x ) = (x - ^ (T® + )) S-‘ (T® - ). 



( 1 ) 



Let N be the number of training sets which are used to obtain estimates x“* , x'^’ and 
S. Asymptotic classification error of sample based DF is given 
as Pg=<I)(-)4 J), where J stands for Mahalanobis distance andO( )is cumulative 
distribution function of N (0,1) (see, e.g., [2]). As both sample size N and dimension- 
ality p increase, distribution of sample based DF approaches Gaussian law. After cal- 
culation of conditional means and common variance of discriminant function (1), one 
can find expected probability of misclassification. 



EP„=0 



2 n NS^)2N-p 



( 2 ) 



[3, 5]. In Eq. (2), term2A/(2A-p) arises due to inexact estimation of covariance ma- 
trix, and term l + 2pjN8^ arises due to inexact estimation of mean vectors. Equation 
(1) will be Euclidean distance classifier (EDC) if covariance matrix S is ignored. EDC 
has relatively good small sample properties. Similarly, if covariance matrix S is de- 
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scribed by small number of parameters, DF with better small sample properties could 
be obtained. An example is the first order decision tree model described in Sect. 4 
(see, e.g., [3, 6]). An alternative way to improve small sample properties is RDA. 
Covariance matrix of RDA is given as = (l-/l)S + /lI, where X is positive 
constant defined in an interval [0 1]. Optimal value of X , denoted by X ^^^ , have to be 

chosen by taking into account the balance between complexity of pattern recognition 
task (structure of the true covariance matrix 2) and sample size N. 

In our investigations, sample size A=100 and the original dimensionality p=196. 
Suppose Bayes error Pg=0.1 (<J = 2.56). Then i + lpj NS^ ~\.6 and 2Nj{2N-p) 
~ 800. High values of these coefficients indicate that we work in serious deficit of 
training data. One way to improve the data deficit problem is to reduce dimensional- 
ity, i.e. perform feature selection. Another way is to use simpler estimate of covari- 
ance matrix. We will use both of them. They are described in Sect. 3 and 4. 



3 Representations of F eature V ectors 

3.1 The Original Feature Vector 

In this paper, 196-dimensional directional element feature [7] was used to represent 
handwritten Japanese characters in database ETL9B [8]. Preliminary to extracting the 
feature vector, a character image was normalized nonlinearly [9] to fit in a 64x64 box. 
Then, skeleton were extracted, and line segments of vertical, horizontal and slanted at 
±45 degrees were extracted. An image is divided into 49 sub-areas of 16x16 dots. 
Sum of each segment in a region is an element of feature vector. 



3.2 Three Representations of Feature Vector 

In constructing three base classifiers for multiple classifier system, we performed 
three kinds of transformations of the feature vector: 

A) original (without transformation), 

B) transformed by = x/‘ for r-th element of the feature (r = 1, ..., 196, i is ar- 
bitrary) and 

C) binarized (0 or 1) (non-zero valued components of the feature vector were 
equalized to 1). 

We comment the reasons of using feature B and C. In Fig. 2a, we have a histogram 
of an element of the feature vector, which corresponds to the sub-area at a boundary 
of an image. The distribution density is highly asymmetric. It is well known that esti- 
mation of covariance matrix requires Gaussian density [10], and nonlinear 
transformations such as transformation (B) often helps reveal correlation structure of 

the data better. Thus, the histogram of nonlinearly transformed by = x/'* is 
performed (Fig. 2b). We notice that the distribution of single feature is obviously 
bimodal and one peak is at zero. One possible way to tackle bimodality problem is to 
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ignore “outliers” (in our case “zero valued features”). As mentioned in Sect. 3.3, our 
dimensionality reduction strategy has similar effect to this for feature B. However, the 
deletion may cause loss of information. Therefore, the third expert classifier utilized 
binary vectors. 




1 / 

(a) Original measurement. (b) After transformation by . 

Fig. 2. Histograms of distribution of 5th feature, x^. 



3.3 Dimensionality Reduction 

When the number of training vectors is unlimitedly large, local pattern recognition 
algorithms (such as Parzen window classifier and ^-NN rule) could lead to minimal 
(Bayes) classification error if properly used. Unfortunately, in practice, the number of 
training vectors is limited, and the dimensionality of feature vector is usually high. 
Therefore, one needs utilize prior information available to build the classification rule. 
An optimal balance between complexity and training sample size has to be retained. If 
sample size is not very large, one has to restrict complexity of base classifiers. In 
small sample case, optimistically biased resubstitution error estimates of the base 
classifiers supplied to fusion rule designer could ruin performance of MCS [11]. 

When we have notably smaller coefficient in Eq. (2), i.e. for dimensional- 
ity p’ =20, 1 + 2 p*/ ~ 1.06. In order to improve small sample properties of base 
classifiers, for each kind of features transformation, we selected only twenty “better” 
features. Selection was performed on bases of sample Mahalanobis distances of each 
original feature. Since sample size was relatively small, we could not use complex 
feature selection strategy. Here, the written character occupies only a part of 7x7 area. 
Like in many of similar character recognition problem, some of the sub-areas are 
almost “empty”. Thus, our feature selection strategy is: at first, the r-th elements of 
196-dimensional feature vectors were divided by their standard deviation of each 
class 5,. (r = 1, 2, ..., 196); then, twenty features (dimensions) whose 1-dimensional 
sample means of two classes are more distant were selected. The experiments showed 
that this feature selection also had important secondary effect: many non-informative 
features which contain a large number of outliers (zero valued measurements) were 
discarded. 
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4 Three Base Classifiers 

Three base classifiers were built in the three feature spaces (A, B, C) respectively. To 
construct robust base classifiers, we utilized the integrated approach of statistical and 
neural network [3]. This approach consists of data whitening and training of SLP. 
This can offer better linear DF by taking advantage of both statistical methods and 
neural networks. 

In the data whitening, one moves data mean vectors to the origin of coordinates, 
and then performs data whitening transformation by use of 

j = A “ K ))> where A and O are the matrices of eigenvalues and 

eigenvectors of simplified covariance matrix SRu^aTreei- We used regularization and 
the first order tree dependence model [6] for the simplified covariance matrix 
SRDA&Treei =STreei(l-4pt) + ^optI- where S^^ei IS covariancc matrix of the first order 

tree dependence model described only by 2p-l independent parameters. This simplifi- 
cation makes the estimate of the covariance matrix more reliable in small sample 
case. Thus, in data whitening, we perform data rotation by means of orthogonal ma- 
trix ® and variance normalization by multiplying rotated data by matrix A ^ . This 
transformation has a secondary effect which have not been discussed in the robust 
statistics literature. Linear transformation of multidimensional data produces weighted 
sums of the original features (see Fig. 3). For this reason, the distribution densities of 
the new features are closer to univariate and unimodal, and often resemble Gaussian 
distribution. In our experiments we noticed that time and again in whitened feature 
space, the first components give good separation of the data. 

After data whitening, SLP was trained in space of y. The training of SLP started 
with zero valued weight vector. After the first batch iteration, we obtain DF (1) whose 

5 is replaced by . If assumptions about structure of covariance matrix are 

truthful, estimate helps to have quite good DF with relatively small error 

rate and good small sample properties just at the very beginning. 

If starting regularization parameter is suitable, proper stopping could help to 

obtain the classifier of optimal complexity. To determine optimal number of batch 
iterations (epochs) to train SLP, we utilized independent pseudo-validation data sets 
with colored noise injection [12]. The pseudo-validation sets were formed by adding 
many (say n) randomly generated zero mean vectors to each training pattern vector. 
The detail is as follows. For each vector X;, its k nearest neighbors are 

found in the same pattern class; then, k lines which connect x, andx,^, {q=\, 2 ,..., k) 
are prepared; along the q-\h line, one adds random variables which follow Gaussian 
distribution N ^0, (cJ | x^ - |)^ j; after adding k components, a new artificial vector is 

obtained. This procedure is repeated n times. Three parameters have to be defined to 
realize a noise injection procedure. In our experiments, we used: k = 2, n = 10, 
and (J = 1 . In fact, noise injection introduces additional non-formal information: it 
declares in an inexplicit way that the space between nearest vectors of one pattern 
class could be filled with vectors of the same category (for more details see [11, 12]). 
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(a) Original feature space. 
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(b) Whitened feature space. 



Fig. 3. Effect of whitening transformation; and “o” stand for feature vectors of two classes 

1 / 

respectively, which were transformed by = x;: ^ . 



Second expert, B, was working in transformed by = x/‘ feature space where 
bimodality of the data was clearly visible. For robust estimation, an influence of out- 
liers was reduced purposefully. We ignored measurements with zero feature values 
for robust estimation of mean vectors and covariance matrix. While estimating the 
mean values and variance of y-th feature, we rejected zero valued training observa- 
tions. To estimate p- , a correlation coefficient between i-thand y-th elements of fea- 
ture vector, we utilized training vectors with only nonzero i-th and y-th components. 



5 Experiments with Handwritten Japanese Characters 

5.1 Fusion Rules 

We utilize a number of different fusion rules and compare classification performances 
of MCSs with RDA used as a benchmark method. From an abundance of known fu- 
sion rules (see, e.g., [13]), eight linear and non-linear rules below were considered to 
make final decision. 

BestT) The best (single) base classifier is selected according to classification re- 
sults using the test set. Actually, this is the ideal classifier which achieves the mini- 
mum error rates in use of the three base classifiers. This classifier and BestV (the next 
item) are weighted voting MCSs that only one weight is unequal to zero. 

BestV) The best (single) base classifier is selected according to classification re- 
sults using the pseudo-validation set. This classifier was used as a benchmark MCS. 

MajV) Majority voting. This is a fixed (non-trainable) fusion rule. 

WStv) Weighed sum of the outputs of the base classifiers. SLP was used as fusion 
rule. The original training data set was used to train SLP classifier and produce coef- 
ficients of weighted sum. Optimal stopping was performed according to classification 
error estimated from pseudo-validation set. 
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BKS) The original behavior knowledge space method (see, e.g., [3, 14]). Alloca- 
tion is performed according to probabilities P^^,...,P^^ of eight combinations 

of binary outputs of three base classifiers; training set were used to estimate the prob- 
abilities mentioned. 

BKSn) Modified BKS method aimed to reduce expert adaptation to training data 
[11]. An independent pseudo-validation set was used to estimate P^^,...,P^^. 

G&R) Nonlinear classification in 3D space of the three outputs of the base classifi- 
ers. To make final allocation, the Parzen window classifier was utilized (this MCS 
utilizes of expert outputs. Its decision making procedure resembles that of Giacinto 
and Roll [15], therefore, it is marked by G&R). 

R&E) Nonlinear fusion of the outputs of the base classifiers where a sample-based 
oracle uses the input vector x in order to decide which expert is the best competent to 
classify this particular vector x. The fusion rule allocates vector x to one of three 
virtual pattern classes (experts). The competence of the y-th expert is estimated as a 
“potential” 



K V, 






S=1 /=1 



q,, exp 



(JC - )^ (X - X;'*' ) 






(7 = 1,2,3), 



(3) 



where g'), = 1 if /-th training vector of ^-th class, was classified by the y-th ex- 
pert correctly, and = -1 if vector x,*** was classified incorrectly, exp{} is a ker- 
nel and h is a smoothing constant. This approach corresponds to Rastrigin and Eren- 
stein [16] fusion rule introduced three decades ago. We marked it by R&E. 

RDA) RDA is one of the most powerful pattern classification tools, and was de- 
scribed in Sect. 4. 



5.2 Experiment 

Each pair of handwritten character contains two similar classes. Each class consists of 
200 vectors. 100 vectors were randomly selected as the training set (A=100), and 
remaining 100 vectors were the test set. To reduce an influence of randomness, the 
experiments were performed 100 times for each pair of characters. Every time, ran- 
dom permutation of vectors was performed in each category. 

Eor each data representation, individual feature selection and subsequent data rota- 
tion and normalization procedures were performed. After few preliminary experi- 
ments following parameters were determined: p* = 20, /l^p, = 0.2, and s = 2. 

In all experiments, only training set and its “product”, artificial pseudo-validation 
set described in Sect. 4, were used to design decision making rules. The SLPs which 
were used as the base classifiers were trained on training set. Optimal number of itera- 
tions was determined by the recognition results of pseudo-validation set. While build- 
ing some of trainable fusion rules, we interchanged training and pseudo-validations 
sets: the fusion rules were trained on validation set, and optimal number of iterations 
was found according to error rates of the training set. The test set was used only once, 
for final evaluation of generalization errors. 
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Results obtained in 800 training sessions (100 independent experiments with eight 
pairs of similar handwritten Japanese characters) are summarized in Table 1. For 
every pair, averaged test error rates of three experts (1“ E, 2°“* E and 3'“* E), BestT and 
BestV are presented in the left five columns of Table 1 (next to the index of character 

pair). Let be the averaged test error rate of BestV (printed in bold in the table) 
and /Jujthod A of “Method A”. Eurther, the ratio of the averaged test error rate of 

“Method A” (corresponding to remaining 6 fusion rules and RDA) to that of BestV 
are shown in the right seven columns of Table 1. Namely, the relative error rate is 

given as ^Method A / -^Bestv • The very last row in the table contains averaged values of 
eight cells of the column. 

Table 1. Average error rates of single experts, BestT and BestV are in the left five columns 
next to index, and relative efficacy of six fusion procedures and RDA are in the seven columns 
in the right. Relative error rates of the most effective fusion rules are underlined. 



PAIR 


T'E 


2“E 


3”E 


BestT 


BestV 


MVot 


WSu 

m 


BKS 


BKSn 


G&R 


R&E 


RDA 


A 


0.037 


0.037 


0.042 


0.035 


0.042 


0.882 


0.899 


1.022 


0.930 


1.239 


0.911 


1.349 


B 


0.129 


0.116 


0.198 


0.112 


0.122 


0.946 


1.012 


1.039 


0.946 


1.090 


1.008 


1.385 


C 


0.0.56 


0.070 


0.1.59 


0.054 


0.066 


1.002 


0.964 


0.923 


1.002 


0.955 


0.841 


1.115 


D 


0.138 


0.139 


0.154 


0.132 


0.144 


0.950 


0.981 


1.051 


1.018 


1.042 


0.972 


1.453 


E 


0.087 


0.083 


0.133 


0.079 


0.103 


0.805 


0.810 


0.865 


0.823 


0.819 


0.842 


1.261 


F 


0.135 


0.130 


0.190 


0.125 


0.138 


0.920 


0.962 


0.996 


0.921 


0.986 


0.972 


1.575 


G 


0.086 


0.088 


0.100 


0.081 


0.091 


0.940 


0.940 


0.948 


0.941 


0.955 


0.968 


1.675 


H 


0.120 


0.119 


0.149 


0.112 


0.125 


0.899 


0.923 


0.960 


0.899 


0.943 


0.967 


1.410 


ALL 


0.098 


0.098 


0.141 


0.091 


0.104 


0.918 


0.936 


0.976 


0.935 


1.004 


0.935 


1.403 



In spite of apparent similarity of eight kinds of Japanese character pairs, we have 
notable variations in experimental results obtained for diverse pairs: both separability 
of pattern classes (classification error rate) and relative efficacy of the experts differ 
by pairs. 

Nevertheless, for all the pairs, RDA whose parameter A, is adjusted to complexity 
of the recognition problem and size of training set was outperformed by MCSs de- 
signed to deal with outliers and multimodality problems. By comparing RDA and 
MCS rules, the highest gain among MCS rules (1.675 times in comparison with 
BestV, and 1.78 times in comparison with Majority Voting) was obtained for pair G 
where all three experts were approximately equally qualified. The lowest gain (1.115 
times in comparison with BestV, and 1.325 times in comparison with Rastrigin- 
Erenstein procedure) was obtained for pair C where the third expert was notably 
worse than two others. 

The training set size of the current problem, lOO-tlOO vectors in 196- variate, is 
rather small. Therefore, sophisticated trainable fusion rules were ineffective: for al- 
most all eight Japanese character pairs considered, the fixed fusion rule. Majority 
Voting, was the best. Exception is pair C because of inefficiency of the third expert, 
that is, only two experts participated in final decision making. Detailed analysis 
shows that in general, all three experts are useful: rejection of one of them assists an 
increase in generalization error of MCS. 
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6 Concluding Remarks 

In this paper, we considered problem of atypical observations in training set in high- 
dimensional situations where sample size is relatively small. Since Gaussian classifi- 
ers are not suitable for atypical observations, as a practical solution, we proposed 
multiple classifiers systems (MCSs) whose base classifiers have different data repre- 
sentations respectively. The base classifiers are constructed with the integrated ap- 
proach of statistical and neural networks. 

To test the proposed MCSs, we considered recognition task of similar pairs of 
handwritten Japanese characters. For all eight similar Japanese character pairs consid- 
ered, all the proposed MCSs outperformed the benchmark classification method, the 
RDA, in the situation of small sample and high-dimensional problem. Utilization of 
MCSs with base classifiers working in differently transformed feature space contains 
supplementary information that nonlinear transformations are important in revealing 
atypical observations. Dealing with the outlier problem, dissimilarity of features al- 
lowed the MCSs to reduce generalization error. 

With a simple feature selection procedure, all three base classifiers worked in re- 
duced feature spaces. The feature selection procedure utilizes additional information: 
a part of the features are notably less important for the linear classification rules de- 
signed to operate with unimodal distributions. Analysis of histograms of rejected 
features showed that often rejected features had bimodal distribution density func- 
tions, i.e. substantial part of data contained zero-valued measurements. This means 
that our feature selection lightened outliers and multimodality problem. 

Training sample size used to train the experts and the trainable fusion rules of 
MCSs is too small for given high-dimensional pattern recognition problem. There- 
fore, fixed fusion rule performed the best. No doubt that in situations with larger 
number of samples, more sophisticated fusion rules would be preferable and could 
lead higher gain in dealing with outlier and multimodality problems. 
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Abstract. Demand for solving complex problems has directed the re- 
search trend in intelligent systems toward design of cooperative multi- 
experts. One way of achieving effective cooperation is through sharing 
resources such as information and components. In this paper, we study 
classifier combination techniques from cooperation perspective. The de- 
gree and method by which multiple classifier systems share training re- 
sources can be a measure of cooperation. Even though data modification 
techniques, such as bagging and A:-fold crossvalidation, have been exten- 
sively used, there is no guidance whether sharing or not sharing training 
patterns results in higher accuracy and under what conditions. We car- 
ried out a set of experiments to examine the effect of sharing training 
patterns on several architectures by varying the size of overlap between 
0-100% of the size of training subsets. The overall conclusion is that 
sharing training patterns among classifiers is beneficial. 



1 Introduction 

Combination of multiple classifiers has been studied and compared from different 
perspectives. These categorizations are based on different design techniques ([1], 
[5], [9]), type of aggregation strategies ([6], [7], [13]), and topology of different 
architectures [11]. However, none of these studies have considered combination 
methods from cooperation viewpoint. Although Sharkey [13] differentiates be- 
tween cooperative and competitive approaches, this distinction refers to the way 
in which outputs of base classifiers are combined to make the final decision. In 
this study, we are interested in the issue of cooperation from a different perspec- 
tive. The degree and method by which multiple classifier systems (MCS) share 
the resources can be a measure of cooperation. A clear picture of the behaviour 
of MCS may emerge by identification of the components/resources and analysis 
of the gains and drawbacks of sharing these resources. 

Sharing can be studied at four fundamental levels of MCS: training, feature 
representation, classifier architectures, and decision making. The basic level of 
sharing occurs at the training level. Sharing at this level may be examined by 
distinguishing three types of resources: training patterns, training algorithms, 
and training strategies. The key investigation, in this study, is how cooperation 
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among classifiers through sharing training patterns affects system performance. 
More training data may result in higher accuracy for individual classifiers. How- 
ever, many popular techniques in MCS have been designed based on partitioning 
of the training data. There are several reasons behind these approaches: i) to 
achieve diversity among classifiers ([8], [14], [15]), ii) to be able to work with 
large datasets [4], in) to obtain classifiers that are experts on specific parts of 
the data space. 

Bagging, boosting, and /c-fold crossvalidation are the most well-known data 
modification approaches. In Bagging [3] , at each run, each classifier is presented 
with a training set that is of the same size as the original training data. These 
training sets consist of samples of training examples that are selected randomly 
with replacement from the original training set. Each classifier is trained with 
one of these training sets and the final classification decision is made by tak- 
ing the majority vote over the class predictions produced by these classifiers. 
Boosting, proposed by Freund and Schapire [12], is a technique for combining 
weak classifiers. In this technique, classifiers and training sets are obtained in 
a more deterministic way, in comparison to bagging. At each step of boosting, 
weights are assigned to data patterns in such a way that higher weight is given 
to the training examples that are misclassified by the classifiers. At the final 
step, outputs of classifiers are combined using weighted majority method for 
each presented pattern. In fc-fold-cross validation, the training set is randomly 
divided into k subsets. Then, k-1 of these sets are used to train a classifier. Sub- 
sequently, the resulted classifier is tested on the subset that has been left out of 
the training. By changing the left out subset in the training process, k classifiers 
are constructed. 

Despite the growing number of publications on different data modification 
techniques and their usage, it is not clear what the advantages and disadvantages 
of sharing training patterns are, and which combining architectures make use 
of sharing data and which ones do not. In this work, we examine the effect 
of sharing patterns by developing simple partitioning schemes for the training 
data similar to Chawla and co-workers [4] . Since training in MCS is affected by 
different factors including type of architectures, we also studied the effects of 
sharing training data on several architectures. Throughout the experiments, for 
each different partitioning technique, we evaluated performance of the system 
using disjoint, different sizes of overlaps, and identical training sets. We were 
interested in finding a relationship between the overlap size and performance of 
several MCS architectures. 



2 Sharing Training Patterns 

Let c = {ci,C 2 , ...,c/} and r = {ri,r 2 , ...,rfc} be sets of system components and 
resources respectively. Let / represent a mapping between a component and a 
resource (a resource used by a component). Then, Sr* is the total number of 
components Ci G c that share a resource r* G r. 
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l 

Sr^ = \Jf{c,,n ( 1 ) 

i=l 

As a result, the total amount of system resources Vj € r shared by components 
Ci G c can be denoted by 



k I 

( 2 ) 

j=i i=i 

In this paper, we investigate the effect of sharing training patterns among 
classifiers. Therefore, we consider components of the system to be the classifiers 
and resources to be the training patterns. We denote a set of training data by 
X = (Ai, X2, ...Xn), where each training pattern Xi is an m dimensional vector. 
The training data is randomly divided into disjoint and overlapped partitions. 
The modified training set consists of X^ = {X^ ,X^ , ...,Xp), where p, p < n, 
is the size of partition randomly selected from n patterns and N represents the 
number of partitions. The partitioning methods used in this study are as follows: 

Disjoint Partitions (DP): Training data is randomly partitioned into N dis- 
joint partitions of size ;^th of the original data. The union of the N training set 
is identical to the original data (Figure 1). 

Disjoint Partitions with Replications (DPR): For each disjoint partition, 
elements of each partition are independently selected in a random fashion and 
added to the partition until the size of each partition is equal to the size of the 
original data (Figure 1). Importantly, there is no overlap between partitions. 

Overlapping Partitions (OP): A certain percentage of patterns are randomly 
selected from the disjoint partitions. The union of these randomly selected pat- 
terns constitutes the overlapping set. The overlapping set is added to each of the 
N partitions independently (Figure 1). 







Fig. 1. Partitioning Techniques, (a) Original Data, (b) DP, (c) DPR, (d) OP, (e) OPR, 
(f) FOP 
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Overlapping Partitions with Replications (OPR): For each overlapping 
partitioning set, elements of each partition are randomly replicated until the size 
of each partition is equal to the size of the original data (Figure 1). The size of 
overlap varies depending on the randomly selected elements that are added to 
each partition. 

Fixed Overlapping Partitions (FOP): This method is similar to OP with the 
exception of substituting overlapping set with a set of randomly selected patterns 
from disjoint partitions (Figure 1). In this technique, the union of partition is 
not identical to the original data. 

Task-Oriented Partitions (TOP): Training data is partitioned into k or less 
disjoint partitions, where k is the number of classes. Each class or a number of 
classes belongs to one partition. This method is only used for training modular 
architectures. 



Algorithm 1 The Partitioning Algorithm 

1: Partition each class, or several classes, to disjoint sets (TOP). 

2: Randomly partition the original training data into disjoint subsets (DP). 

3: Randomly replicate patterns of each disjoint set (DPR). 

4: For each overlap size, make an overlapping set by randomly selecting patterns from 
the disjoint partitions, generated in the step 2. 

5: Add the overlapping sets to the disjoint sets (OP). 

6: Randomly replicate patterns of the partitions generated in step 5 (OPR). 

7: Substitute overlapping sets in each of the disjoint sets by randomly eliminating 
some of the patterns from the disjoint sets (FOP). 



The training data partitions were prepared based on the following generation 
scheme (Algorithm 1). Each one of these partitioning schemes was developed for 
a different objective: i) DP, DPR and TOP to obtain disjoint partitions, ii) DPR 
and OPR to obtain partitions equal to the size of original training data, and Hi) 
OP, OPR, and FOP to achieve partitions with overlap. Because of lack of studies 
from sharing point of view, it was not clear which method (shared or disjoint 
data representation) yields to a better generalization performance. 

3 Description of Architectures 

The significance of sharing patterns may depend on the architectures of MCS. 
In order to draw reliable conclusions, we implemented four architectures modu- 
lar, ensemble, stacked generalization, and hierarchical mixtures of experts, and 
examined the effect of partitioning techniques and overlap size on each one of 
them. Description of architectures is as follows: 

— Ensemble: classifiers were trained in parallel using the subsets obtained by 
DP, DPR, OP, OPR and FOP partitioning methods. The final decision was 
made using Majority Vote and Weighted Averaging aggregation methods. 
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— Modular: complete decoupling [1] and top-down competitive [13] architec- 
tures were implemented. For each of these architectures, Task-Oriented Par- 
titioning (TOP) method was used in which each class or multiple disjoint 
classes were partitioned into one training set. By adding overlapping set 
(OP) to these disjoint sets, the effect of sharing was evaluated. In top-down 
competitive method, control switching to an appropriate module was done 
using a backpropagation classifier. 

— Stacked Generalization: classifiers were arranged in layers [16]. Similar to 
ensemble architecture, DP, DPR, OP, OPR and FOP methods were used for 
partitioning. The final decision was made using a backpropagation network 
located at the higher level. 

— Hierarchical Mixtures of Experts (HME): classifiers were arranged in layers 
[10]. DP, DPR, OP, OPR and FOP methods were used for obtain disjoint 
and overlapped training subsets. Backpropagation classifiers were used at 
the gating level. 

4 Experimental Setup 

We used two benchmark datasets for this study. The 80-D Correlated Gaussian 
is an 80 dimensional artificial data [8]. It consists of two Gaussian classes with 
equal covariance matrices and 500 vectors for each class. The second dataset, 
20-class Gaussian, is a two dimensional artificial dataset that contains 20 classes 
[2]. Each class has a total number of 100 patterns. 

We studied different sizes of overlaps. The size of overlapping set varied 
between 0 to 100 percent of the size of partitions, which resulted in 0-50% of 
shared data patterns among classifiers. This ratio can be calculated using Eq. 
2, as Sp = , where p is the size of disjoint partitions and Sp is the size of 

overlapping set. We were, also, interested to study the influence of the sample 
size on MCS in the presence of shared and disjoint training data. We considered 
three sample sizes of small, medium and large for each of the training sets. The 
selection of sizes was based on the datasets dimensionality. We considered p < m 
for small size, p > m for medium size, and p ^ m for large size where m is the 
data dimension. 

We used Linear and Quadratic classifiers as base classifiers in our experi- 
ments. The use of stable classifier enabled us to obtain consistent results by 
eliminating drastic fluctuation in the system performance caused by classifiers 
themselves. The number of classifiers was varied from 6 to 9. Given the size of 
the datasets, creating partitions of more than 10 resulted in insufficient training 
data and poor performances for the classifiers. 

Previous to training data partitioning, datasets were divided into training 
and testing subsets. Subsequently, the training set was partitioned to smaller 
subsets (Algorithm 1). The partitioning process and combination process were 
repeated between 20-30 times with different random seeds. The choice of repar- 
titioning the training data was dictated by the need of having consistent results 
and eliminating the influence of random selection. We calculated the mean and 
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standard deviation of MCS performance to estimate the amount of changes in 
their performances. 



5 Results and Discussion 

We examined a set of plots and compared the performance of different partition- 
ing methods. We observed a common pattern of change among three partitioning 
methods. Figure 2 illustrates error rate for 80-D Gaussian dataset using major- 
ity vote aggregation method (ensemble architecture). Adding larger overlaps in 
the FOP method always resulted in decline of generalization capability of the 
system. Unlike DP and DPR methods that resulted in either improvement or no 
change in the system. The difference between FOP method and other techniques 
is that the union of the partitions generated by FOP is not the same as the orig- 
inal training data. This means that by adding larger overlaps to the partitions, 
many of the patterns are eliminated and not used in the training of MCS. It 
made a little or no difference whether the partitions were created by sampling 
with or without replications. Results for DP and OP are highlighted for the rest 
of the paper. 




Fig. 2. Performance of Different Partitioning Techniques 



Figures 3-14 compare the MCS error for different architectures and training 
sample sizes: small, medium and large. In addition, for each of the training sizes, 
these figures illustrate the changes in MCS error with respect to the size of over- 
lap among partitions. The straight line, in these figures, represents performance 
of bagging on the original training set. By examining the results, the overall 
conclusion was that sharing training data was useful, especially for small and 
medium training sizes. In the case of 20 class dataset (Figure 3 and 4), combining 
classifiers on disjoint and small overlap sizes of large training set outperformed 
bagging. However, sharing patterns for small training set was not as effective as 
medium size. This is likely due to the fact that the size of partitions for small 
training size was too small for classifiers to be able to learn well. Another obser- 
vation was that sharing training data for the high dimensional (complex) dataset 
such as 80-D Gaussian was more essential. The need for larger training sets can 
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Fig. 3. 20 Class: Majority Vote Fig. 4. 20 Class: Weighted Averaging 




Fig. 5. 20 Class: Decoupled Fig. 6. 20-class: Top-down Competitive 



be seen in Figures 9-14, in which bagging trained on the original training set 
outperformed all MCS that were trained on smaller sets. 

Increasing the size of the sharing (overlaps) resulted in improvement of the 
performance for all architectures. This observation was examined for traditional 
combination rules (Figures 3, 4, 9, 10). For modular achitecture, classifiers con- 
structed on TOP, disjoint partitions were biased toward their own classes (Fig- 
ures 5, 6, 11, 12). Improvement in performance was likely due to the fact that 
presenting patterns from other classes might have helped the classifiers to gain a 
common view about the problem and, therefore, improve MCS performance. In 
the case of Stacked Generalization and HME architectures (Figures 7, 8, 13, 14), 
performances of the upper levels classifiers were highly dependent on the perfor- 
mance of classifiers at the lower levels. Classifiers trained on small datasets might 
have large variances, since parameters of classifiers were poorly estimated. Thus, 
input to the next level were not accurate, which resulted in incorrect estimation 
of the final decision. 

Another observation was that, even though adding larger overlaps to disjoint 
subsets reduced diversity among the classifiers, most of the time resulted in bet- 
ter performance. Selecting disjoint subsets of patterns increased the chance of 
obtaining a group of independent classifiers. However, it might decline perfor- 
mance of the individual classifiers if too few patterns were used. Consequently, 
performance of combined less accurate classifiers would be low. These observa- 
tions were consistent with Kuncheva et al. [8] findings in which they concluded 
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Fig. 7. 20 Class: Stacked Generalization Fig. 8. 20 Class: HME 




Fig. 9. 80-d Gaussian: Majority Vote pig. 10. 80-d Gaussian: Weighted Averag- 
ing 



that there was no strong relationship between diversity and accuracy. As a result, 
the best construction strategy for the MCS is likely to be achieved if classifiers 
can meet the criteria of being diverse and be able to generalize well [14]. 



6 Conclusion 

The performance of MCS is affected by many factors such as: the choice of base 
classifier, the training sample size, the way training data is partitioned, and the 
choice of architecture. Therefore, it is difficult to establish universal criteria to 
generalize the usefulness or drawbacks of sharing. In this paper, we empirically 
investigated correlation between sharing training data and MCS performance. 
We developed several partitioning techniques. The effect of sharing was exam- 
ined considering these techniques with two datasets, several architectures, and 
training data sizes. 

The overall observation from the results is that sharing is generally benefi- 
cial. Improvement over larger overlap subsets, for all the architectures, may be 
explained by the fact that applying disjoint partitions to classifiers, results in a 
set of biased classifiers towards their own training data partitions. Combining 
these biased classifiers may decline the performance of MCS. Sharing portions 
of training data is an alternative attempt to improve the performance. Shared 
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Fig. 11. 80-d Gaussian: Decoupled Fig. 12. 80-d Gaussian: Top-down Com- 
petitive 




Fig. 13. 80-d Gaussian: Stacked General- Fig. 14. 80-d Gaussian: HME 

ization 



information replicated across the subsets of training data attempts to provide 
each individual classifier with a more accurate view of the problem in-hand. 

This study suggests that if there is too little data or a complex problem in- 
hand, the gains achieved by replicating the patterns cannot compensate for the 
decrease in accuracy of individual classifiers. As a result, it is more advisable to 
use the whole data for training and try to obtain diversity through other meth- 
ods. On the other hand, if the data is large enough, there is a sweet spot for the 
training data size for which diversity through data partitioning is most effective. 
Presented results even illustrated that, for less complex data with large enough 
training set, partitioning the data into disjoint sets may result in improvement 
in the accuracy of multiple classifiers. 
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Abstract. Building an ensemble of classifiers is an useful way to improve the 
performance with respect to a single classifier. In the case of neural networks 
the bibliography has centered on the use of Multilayer Feedforward. However, 
there are other interesting networks like Radial Basis Functions (RBF) that can 
be used as elements of the ensemble. Furthermore, as pointed out recently the 
network RBF can also be trained by gradient descent, so all the methods of con- 
structing the ensemble designed for Multilayer Feedforward are also applicable 
to RBF. In this paper we present the results of using eleven methods to con- 
struct an ensemble of RBF networks. We have trained ensembles of a reduced 
number of networks (3 and 9) to keep the computational cost low. The results 
show that the best method is in general the Simple Ensemble. 



1 Introduction 

Probably the most important property of a neural network (NN) is the generalization 
capability, i.e., the ability to correctly respond to inputs which were not used in the 
training set. 

One method to increase the generalization capability with respect to a single NN 
consist on training an ensemble of NNs, i.e., to train a set of NNs with different 
weight initialization or properties and combine the outputs of the different networks 
in a suitable manner to give a single output. 

In the field of ensemble design, the two key factors to design an ensemble are how 
to train the individual networks to get uncorrelated errors and how to combine the 
different outputs of the networks to give a single output. 

It seems clear from the bibliography that this procedure generally increases the 
generalization capability in the case of the NN Multilayer Feedforward [1,2]. 

However, in the field of NNs there are other interesting networks besides Multi- 
layer Feedforward, and traditionally the use of ensembles of NNs has restricted to the 
use of Multilayer Feedforward as element of the ensemble. 

Another interesting network which is quite used in applications is Radial Basis 
Functions (RBF). This network can also be trained in a fully supervised way by gradi- 
ent descent as shown recently [3-4]. Furthermore the performance of this way of 
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training is even superior to the traditional unsupervised clustering for the centers and 
the supervised training for the weights of the outputs. 

So with a fully supervised gradient descent training, this network can be also an 
element of an ensemble, and all methods of constructing the ensemble which are ap- 
plicable to Multilayer Feedforward can now be also used with RBF networks. 

In this paper we try different methods of constructing the RBF networks in the en- 
semble to obtain the first results on ensembles of RBF networks. 

Among the methods of combining the outputs, the two most popular are voting and 
output averaging [5]. In this paper we will normally use output averaging, we have 
also performed experiments with voting and the results are completely similar in the 
case of RBF networks. 

2 Theory 

In this section, first we briefly review the basic concepts of RBF networks and gradi- 
ent descent training and after that we review the different method of constructing the 
ensemble which are applied to the RBF networks. Full descriptions of both items can 
be found in the references. 



2.1 RBF Networks with Gradient Descent Training 



A RBF has two layer of networks (without considering the input units). The first layer 
is composed of neurons with a Gaussian transfer function and the second layer (the 
output units) has neurons with a linear transfer function. The output of a RBF network 
can be calculated with equation 1. 
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Where are the components of the center of the Gaussian functions, (j‘ control 
the width of the Gaussian functions and are the weights among the Gaussian units 
and the output units. 

In the case of RBF neural networks trained with gradient descent [3], a constant (j‘ 
is used for all Gaussian units of the network, and this parameter is determined by trial 
and error, fixing a value before training, i.e., the width is not adaptively change during 
training. 

The parameters which are changed during the training process are the centers 
of the Gaussian and w^‘the weights among the Gaussian units and the output units. In 
the original reference the training is performed off-line (i.e. in Batch mode) but we 
have concluded that the online version presented here has several advantages like a 
lower number of iterations to complete training. The equations for the adaptation of 
the weights is the following: 
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Where 77 is the step size and is the difference between the target and the output 
of the network for the particular training pattern. 

The equation 3 calculate the adaptation of the centers and the equation 4 



Ac^ = n-slix,-c^) 


(3) 


^0 


(4) 


II 





k=\ 



Where is the ^h training pattern and is the number of output units and a ^ is 
in the following equation. 




V J 



2.2 Ensemble Design Methods 

Simple Ensemble: A simple ensemble can be constructed by training different net- 
works with the same training set, but with different random initialization in the 
weights, center and weights in the case of RBF networks. In this ensemble technique, 
we expect that the networks will converge to different local minimum and the errors 
will be uncorrelated. 

Bagging: This ensemble method is described in reference [5]. The ensemble method 
consists on generating different datasets drawn at random with replacement from the 
original training set. After that, we train the different networks in the ensemble with 
these different datasets (one network per dataset). We have used datasets which have 
a number of training points equal to twice the number of points of the original training 
set, as it is recommended in the reference [1]. 

Bagging with Noise (BagNoise): It was proposed in [2]. It is a modification of Bag- 
ging, we use in this case datasets of size 10-N (number of training points) generated in 
the same way of Bagging, where N is the number of training points of the initial train- 
ing set. Also we introduce a random noise in every selected training point drawn from 
a normal distribution with a small variance. 

Boosting: This ensemble method is reviewed in [5]. It is conceived for a ensemble of 
only three networks. It trains the three network of the ensemble with different training 
sets. The first network is trained with the whole training set, N input patterns. After 
this training, we pass all N patterns through the first network and we use a subset of 
them, such that the new training set has 50% of patterns incorrectly classified by the 
first network and 50% correctly classified. With this new training set we train the 
second network. After the second network is trained, the N original patterns are pre- 
sented to both networks. If the two networks disagree in the classification, we add the 
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training pattern to the third training set. Otherwise we discard the pattern. With this 
third training set we train the third network. 

In the original theoretical derivation of the algorithm, the evaluation of the test per- 
formance was as follows: present a test pattern to the three networks, if the first two 
networks agree, use this label, otherwise use the label assigned by the third network. 

CVC: It is reviewed in [1]. In k-fold cross-validation, the training set is divided into k 
subsets. Then, k-1 subsets are used to train the network and results are tested on the 
subset that was left out. Similarly, by changing the subset that is left out of the train- 
ing process, one can construct k classifiers, each of which is trained on a slightly 
different training set. 



Adaboost: We have implemented the algorithm denominated “Adaboost.Ml ” in the 
reference [6]. In the algorithm the successive networks are trained with a training set 
selected at random from the original training set, but the probability of selecting a 
pattern changes depending on the correct classification of the pattern and on the per- 
formance of the last trained network. The algorithm is complex and the full descrip- 
tion should be looked for in the reference. The method of combining the outputs of 
the networks is also particular to this algorithm. 



Decorrelated (Deco): This ensemble method was proposed in [7]. It consists on in- 
troducing a penalty term added to the usual Backpropagation error function. The pen- 
alty term for network number j in the ensemble is in equation 6. 

Penalty = X-d(i, j){y - f^Hy - fj) (6) 



Where X determines the strength of the penalty term and should be found by trial 
and error, y is the target of the training pattern and^ and^ are the outputs of networks 
number i and j in the ensemble. The term d(i,j) is in equation 7. 



d(i,j) 



|1, ifi=j-\ 
[ 0, otherwise 



(7) 



Decorrelated2 (Deco2): It was proposed also in reference [7]. It is basically the same 
method of “Decorrelated” but with a different term d(i,j) in the penalty. In this case 
the expression of d( i,j) is in equation 8. 






[l, if i = 7 - 1 and i is even 
[ 0, otherwise 



( 8 ) 



Evol: This ensemble method was proposed in [8]. In each iteration (presentation of a 
training pattern), it is calculated the output of the ensemble for the input pattern by 
voting. If the output is correctly classified we continue with the next iteration and 
pattern. Otherwise, the network with an erroneous output and lower MSE (Mean 
Squared Error) is trained in this pattern until the output of the network is correct. This 
procedure is repeated for several networks until the vote of the ensemble classifies 
correctly the pattern. For a full description of the method see the reference. 
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Cels: It was proposed in [9]. This method also uses a penalty term added to the usual 
Backpropagation error function to decorrelate the output of the networks in the en- 
semble. In this case the penalty term for network number i is in equation 9. 

Penalty = - y)-^ (/^. - y) (9) 

Where y is the target of the input pattern and/! and^ the outputs of networks num- 
ber i an j for this pattern. 

The authors propose in the paper the winner-take-all procedure to combine the out- 
puts of the individual networks, i. e., the highest output is the output of the ensemble. 
We have used this procedure and the usual output averaging. 

Ola: This ensemble method was proposed in [10]. First, several datasets are generated 
by using bagging with a number of training patterns in each dataset equal to the origi- 
nal number of patterns of the training set. Every network is trained in one of this data- 
sets and in virtual data. The virtual data for network i is generated by selecting ran- 
domly samples for the original training set and perturbing the sample with a random 
noise drawn from a normal distribution with small variance. The target for this new 
virtual sample is calculated by the output of the ensemble without network number i 
for this sample. For a full description of the procedure see the reference. 



3 Experimental Results 

We have applied the eleven ensemble methods to nine different classification prob- 
lems. They are from the UCI repository of machine learning databases. Their names 
are Balance Scale (BALANCE), Cylinders Bands (BANDS), Liver Disorders 
(BUPA), Credit Approval (CREDIT), Glass Identification (GLASS), Heart Disease 
(HEART), the Monk’s Problems (MONK’l, MONK’2) and Voting Records (VOTE). 
The complete data and a full description can be found in the UCI repository 
(http://www.ics.uci.edu/~mlearn/MLRepository.html). 

The general characteristic of these databases are resumed in Table 1. The second 
column with header “Ninput” is the number of inputs of the database. Column fifth 
“Noutput” is the number of output (classes) of the database. Column sixth “Ntrain” is 
the number of patterns included in the training set. Column number seven “Ncross” is 
the number of patterns in the cross-validation set and finally column number eight 
“Ntest” is the number of patterns to test the performance. Eurthermore, in this table 
we have several characteristic of the RBF networks trained in the ensemble, column 
number three “Nclusters” is the number of Gaussian units in the RBE network and 
column number four contains the learning step of the gradient descent algorithm, 
these parameters were selected by trial and error and cross-validation to find the most 
appropriate values. 

The first step to construct the ensembles was to determine the right parameters for 
each database, in the case of methods Cels (parameter lambda of the penalty), Ola 
(standard deviation of the noise). Deco and Deco2 (parameter lambda of the penalty) 
and BagNoise (standard deviation of the noise). The value of the final parameters 
obtained by trial and error and cross-validation is in Table 2. 
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Table 1. General characteristics of the Databases and Networks. 



Database 


Ninput 


Nclusters 


Step T| 


Nontput 


Ntrain 


Ncross 


Ntest 


Balance 


4 


60 


0.005 


3 


395 


105 


125 


Band 


39 


40 


0.01 


2 


177 


45 


55 


Bupa 


6 


40 


0.01 


2 


220 


55 


70 


Credit 


15 


30 


0.005 


2 


418 


105 


130 


Glass 


10 


110 


0.01 


6 


129 


35 


50 


Hear 


13 


20 


0.005 


2 


190 


48 


59 


Monkl 


6 


30 


0.005 


2 


282 


70 


80 


Monk2 


6 


45 


0.005 


2 


282 


70 


80 


Vote 


16 


5 


0.01 


2 


285 


70 


80 



Table 2. Parameters of different ensemble methods. 





Cels 


Ola 


Deco 


Deco2 


Bag 

Noise 




3 


9 


3 


9 


Balance 


0.75 


0.25 


0.4 


0.6 


0.6 


0.6 


0.1 


Band 


0.1 


0.5 


0.3 


0.3 


0.6 


0.6 


0.1 


Bnpa 


0.1 


0.1 


0.6 


0.6 


0.6 


0.6 


0.1 


Credito 


0.25 


0.25 


0.6 


0.5 


0.6 


0.6 


0.4 


Glas 


0.25 


0.25 


0.3 


0.3 


0.6 


0.6 


0.1 


Heart 


0.1 


0.1 


0.4 


0.3 


0.6 


0.6 


0.2 


Mokl 


0.1 


0.1 


0.2 


0.6 


0.6 


0.6 


0.1 


Mok2 


0.1 


0.25 


0.6 


0.6 


0.8 


0.6 


0.1 


Vote 


0.1 


0.1 


0.6 


0.6 


0.6 


0.6 


0.1 



With these parameters we trained the ensembles of 3 and 9 networks, we selected a 
low number of networks to keep the computational cost low. We repeated this process 
of training an ensemble ten times for ten different partitions of data in training, cross- 
validation and test sets. In this way, we can obtain a mean performance of the ensem- 
ble for each database (the mean of the ten ensembles) and an error in the performance 
calculated by standard error theory. The results of the performance are in Table 3 for 
the case of ensembles of three networks and in Table 4 for the case of nine. We have 
included also in Table 3 the mean performance of a single network for comparison. 

As commented before, we have performed experiments with voting and output av- 
eraging as combination methods, the results are similar and we have reproduced here 
the results for output averaging, except for the case of Cels which is clearly better the 
winner-take-all strategy and for the case of Evol where voting is better that output 
averaging. 

By comparing the results of Tables 3 and 4 with the results of a single network we 
can see that the improvement of the ensemble is database and method dependent. 
Sometimes, the general performance of an ensemble (as in case of Balance) is worse 
than the single network, the reason may be the combination method {output averag- 
ing) which does not exploit the performance of the individual networks. Besides that, 
there is one method which clearly perform worse than the single network which is 
Evol, but we obtained the same result for ensembles of Multilayer Feedforward net- 
works. 

To see the results more clearly, we have also calculated the percentage of error re- 
duction of the ensemble with respect to a single network. We have used equation 10 
for this calculation. 
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Table 3. Results for the ensemble of three networks. 





BALANCE 


BAND 


BUPA 


CREDIT 


GLAS 


Single Net. 


90.2 ± 0.5 


74.0 ± 1.1 


70.1 ± 1.1 


86.0 ±0.8 


93.0 ±0.6 


Adaboost 


88.0 ± 1.2 


73 ±2 


68.4 ± 0.5 


84.7 ± 0.6 


91.3 ±1.3 


Bagging 


89.7 ±0.8 


73 ±2 


70.1 ± 1.6 


87.4 ±05. 


93.8 ± 1.2 


Bag Noise 


89.8 ±0.8 


73.1 ± 1.3 


64 ±2 


87.1 ±0.7 


92.2 ± 0.9 


Boosting 


88.2 ±0.8 


70.7 ±1.8 


70.6 ± 1.6 


86.6 ±0.7 


92.2 ± 0.9 


Cels 


89.5 ±0.8 


75.3 ±1.4 


69.3 ±1.4 


86.9 ±0.5 


93.0 ± 1.0 


CVC 


90 ± 0.7 


75.1 ± 1.4 


69.9 ± 1.5 


87.5 ±0.5 


92.4 ± 1.1 


Decorrelated 


89.8 ±0.8 


73.3 ± 1.7 


71.9± 1.6 


87.1 ±0.5 


93.2 ± 1.0 


Decorrelatedl 


89.8 ±0.8 


73.8 ± 1.4 


71.3 ±1.4 


87.2 ±0.5 


93.4 ± 1.0 


Evol 


89.5 ±0.9 


67.6 ± 1.3 


63 ±2 


84.6 ± 1.1 


88 ±2 


Ola 


88.2 ±0.9 


73.1 ± 1.3 


68.1 ± 1.5 


86.1 ±0.7 


89.4 ± 1.8 


Simple Ense 


89.7 ±0.7 


73.8 ± 1.2 


71.9± 1.1 


87.1 ±0.5 


93.2 ± 1.0 



Table 3 (continuation). Results for the ensemble of three networks. 





HEART 


MOKl 


MOK2 


VOTE 


Single Net. 


82.0 ± 1.0 


98.5 ±0.5 


91.3 ±0.7 


95.4 ±0.5 


Adaboost 


82.0 ± 1.4 


88.0 ± 1.7 


84.9 ± 1.5 


95.1 ±0.8 


Bagging 


83.6 ± 1.8 


99.1 ±0.6 


88.0 ± 1.4 


95.8 ±0.6 


Bag Noise 


83.2 ± 1.5 


94.3 ±1.3 


89.5 ± 1.1 


95.6 ±0.7 


Boosting 


82.4 ± 1.2 


94.9 ± 1.4 


88.4 ± 1.0 


95.9 ±0.6 


Cels 


82.7 ± 1.7 


94.3 ±1.3 


89.5 ± 1.3 


94.9 ± 0.6 


CVC 


83.9 ± 1.6 


95.5 ±1.3 


84.9 ± 1.7 


95.9 ±0.7 


Decorrelated 


84.1 ± 1.4 


99.5 ± 0.4 


89.8 ±1.3 


96.0 ±0.7 


Decorrelated! 


83.3 ± 1.4 


99.4 ± 0.4 


91.9 ± 1.2 


96.1 ±0.6 


Evol 


79 ±2 


74.8 ± 1.4 


64.6 ± 1.6 


92.3 ± 0.7 


Ola 


81.5 ± 1.2 


75.1 ± 1.1 


83.1 ± 1.1 


96.1 ±04. 


Simple Ense 


84.6 ± 1.5 


99.6 ± 0.4 


90.9 ± 1.1 


96.4 ± 0.6 



Table 4. Results for the ensemble of nine networks. 





BALANCE 


BAND 


BUPA 


CREDIT 


GLAS 


Single Net. 


90.2 ± 0.5 


74.0 ± 1.1 


70.1 ± 1.1 


86.0 ±0.8 


93.0 ±0.6 


Adaboost 


91.8±0.8 


71.5 ± 1.2 


69.6± 1.1 


84.8 ±0.8 


93.1 ± 1.3 


Bagging 


90 ± 0.8 


74.3 ±1.4 


71.0± 1.5 


87.5 ±0.5 


93.6 ± 1.2 


Bag Noise 


90 ± 0.8 


73.1 ± 1.1 


64.1 ± 1.9 


87.1 ±0.5 


91.4 ±0.9 


Cels 


89.7 ±0.8 


74.0 ± 1.2 


69.4 ± 1.9 


87.1 ±0.5 


92.4 ± 1.1 


CVC 


89.8 ±0.8 


73.6 ± 1.3 


70 ±2 


87.6 ±0.5 


93.0± 1.1 


Decorrelated 


89.8 ±0.8 


73.5 ±1.8 


71.4 ± 1.4 


87.2 ±0.6 


93.0 ± 1.0 


Decorrelated! 


89.8 ±0.8 


73.8 ± 1.8 


71.6± 1.2 


87.2 ±0.5 


93.2 ± 1.0 


Evol 


88.1 ± 1.1 


67.6 ± 1.3 


63 ±2 


83.4 ± 1.3 


82.6 ± 1.8 


Ola 


88.5 ±0.7 


74.5 ±1.2 


69.6 ± 1.4 


81.8± 1.1 


90.6 ± 1.2 


Simple Ense 


89.7 ±0.7 


73.3 ±1.4 


72.4 ± 1.2 


87.2 ±0.5 


93.0 ± 1.0 



Table 4 (continuation). Results for the ensemble of nine networks. 





HEART 


MOKI 


MOK! 


VOTE 


Single Net. 


82.0 ± 1.0 


98.5 ±0.5 


91.3 ±0.7 


95.4 ±0.5 


Adaboost 


81.0± 1.5 


91.1 ± 1.4 


87.0 ± 1.2 


95.9 ±0.6 


Bagging 


84.9 ± 1.2 


99.4 ± 0.4 


89.3 ± 1.2 


95.9 ±0.6 


Bag Noise 


83.6 ± 1.6 


95.5 ±1.4 


90.1 ± 1.2 


96.0 ±0.7 


Cels 


82.5 ± 1.4 


93.3 ±0.7 


87.5 ± 1.2 


94.9 ± 0.6 


CVC 


84.1 ± 1.3 


99.4 ± 0.5 


90.5 ± 1.0 


96.1 ±0.7 


Decorrelated 


84.1 ± 1.3 


99.4 ± 0.4 


90.3 ± 1.1 


96.1 ±0.7 


Decorrelated! 


84.7 ± 1.4 


99.6 ± 0.4 


91.1 ± 1.2 


96.1 ±0.6 


Evol 


78 ±2 


74.8 ± 1.4 


64.6 ± 1.6 


92.3 ± 0.7 


Ola 


78 ±2 


71.1 ± 1.6 


82.0 ± 1.4 


96.1 ±0.4 


Simple Ense 


83.9 ± 1.5 


99.6 ± 0.4 


91.4 ± 1.2 


96.3 ± 0.6 
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PorErwr^^j^^,i^„ 



100 ~ P OrErrOr^„,en,Me 



( 10 ) 



In this last equation, PorError^.^^,^ is the error percentage of a single network 
(for example, 100-90.2=9.8% in the case of database Balance, see Table 3) and 
PorError^^^i^ is the error percentage in the ensemble with a particular method (for 
example, 100-88.0=12.0% in the case of Adaboost and database Balance, see Ta- 
ble 3). 

The value of the percentage of error reduction ranges from 0%, where there is no 
improvement by the use of a particular ensemble method with respect to a single net- 
work, to 100% where the error of the ensemble is 0%. There can also be negative 
values, which means that the performance of the ensemble is worse than the perform- 
ance of the single network. 

This new measurement is relative and can be used to compare more clearly the dif- 
ferent methods. In Table 5 we have the results for each database and method in the 
case of the ensemble of nine networks. 

Furthermore we have calculated the mean performance across all databases of the 
percentage of error reduction and is in the last column with header “Mean”. 

According to this mean measurement there are five methods which perform worse 
than the Single Network, they are Adaboost, BagNoise, Cels, Evol and Ola. The per- 
formance of Evol is clear, in all databases is worse than the single network. Ola, 
Adaboost, BagNoise and Cels are in general problem dependent (unstable), for exam- 
ple Adaboost gets an improvement in databases Balance and Vote, but in the rest the 
performance is poor. 



Table 5. Percentage of error reduction for the ensemble of 9 networks. 





BALANCE 


BAND 


BUPA 


CREDIT 


GLAS 


Adaboost 


16.33 


-9.62 


-1.67 


-8.57 


1.43 


Bagging 


-2.04 


1.15 


3.01 


10.71 


8.57 


Bag Noise 


-2.04 


-3.46 


-20.07 


7.86 


-22.86 


Cels 


-5.10 


0 


-2.34 


7.86 


-8.57 


CVC 


-4.08 


-1.54 


-0.33 


11.43 


0 


Decorrelated 


-4.08 


-1.92 


4.35 


8.57 


0 


Decorrelatedl 


-4.08 


-0.77 


5.02 


8.57 


2.86 


Evol 


-21.42 


-24.62 


-23.75 


-18.57 


-148.57 


Ola 


-17.35 


1.92 


-1.67 


-30 


-34.29 


Simple Ense 


-5.10 


-2.69 


7.69 


8.57 


0 



Table 5 (continuation). Percentage of error reduction for the ensemble of 9 networks. 





HEART 


MOKl 


MOKl 


VOTE 


MEAN 


Adaboost 


-5.56 


-493.33 


-49.43 


10.87 


-59-95 


Bagging 


16.11 


60 


-22.99 


10.87 


9.49 


Bag Noise 


8.89 


-200 


-13.79 


13.04 


-25.82 


Cels 


2.78 


-346.67 


-43.68 


-10.87 


-45-18 


CVC 


11.67 


60 


-9.20 


15.22 


9.24 


Decorrelated 


11.67 


60 


-11.49 


15.22 


9.14 


Decorrelated! 


15 


73.33 


-2.30 


15.22 


12.53 


Evol 


-22.22 


-1580 


-306.90 


-67.39 


-245.94 


Ola 


-22.22 


-1826.67 


-106.90 


15.22 


-224.66 


Simple Ense 


10.56 


73.33 


1.15 


19.57 


12.56 
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The best and most regular method across all databases seems to be the Simple En- 
semble. 

Finally, to see the influence of the number of networks in the ensemble we present 
the results of the mean percentage of error reduction for ensembles of three and nine 
networks in Table 6. We can see from the results that in general there is an improve- 
ment in performance from the ensemble of three to the ensemble of nine networks. 
Perhaps, the unique notable exception is the performance of the Simple Ensemble 
which is roughly the same. 



Table 6. Mean percentage of error reduction for ensembles of 3 and 9 networks. 





Ensemble of 3 networks 


Ensemble of 9 networks 


Adaboost 


-93.96 


-59-95 


Bagging 


3.57 


9.49 


Bag Noise 


-35.69 


-25.82 


Boosting 


-33.20 


— 


Cels 


-34.01 


-45-18 


CVC 


-27.60 


9.24 


Decorrelated 


9.34 


9.14 


Decorrelatedl 


11.14 


12.53 


Evol 


-234.21 


-245.94 


Ola 


-191.45 


-224.66 


Simple Ense 


12.96 


12.56 



4 Conclusions 

In this paper we have presented results of eleven different methods to construct an 
ensemble of RBF networks, using nine different databases. We trained ensembles of a 
reduced number of networks, in particular three and nine networks to keep the com- 
putational cost low. The results showed that in general the performance is method and 
problem dependent, sometimes the performance of the ensemble is even worse than 
the single network, the reason can be that the combination method (output averaging) 
is not appropriate in the case of RBF networks. The best and most regular performing 
method across all databases seems to be the Simple Ensemble, i.e., the rest of methods 
proposed to increase the performance of Multilayer Feedforward seems not to be so 
useful in RBF networks. Perhaps, another reason of this result in the combination 
method (output averaging) which may not be very appropriate as commented before, 
the future research will go in the direction of trying other combination methods with 
ensembles of RBF networks. 
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Abstract. Bagging can be interpreted as an approximation of random 
aggregating, an ideal ensemble method by which base learners are trained 
using data sets randomly drawn according to an unknown probability 
distribution. An approximate realization of random aggregating can be 
obtained through subsampled bagging, when large training sets are avail- 
able. In this paper we perform an experimental bias-variance analysis of 
bagged and random aggregated ensembles of Support Vector Machines, 
in order to quantitatively evaluate their theoretical variance reduction 
properties. Experimental results with small samples show that random 
aggregating, implemented through subsampled bagging, reduces the vari- 
ance component of the error by about 90%, while bagging, as expected, 
achieves a lower reduction. Bias-variance analysis explains also why en- 
semble methods based on subsampling techniques can be successfully 
applied to large data mining problems. 



1 Introduction 

Random aggregating is a process by which base learners, trained on samples 
drawn accordingly to an unknown probability distribution from the entire uni- 
verse population, are aggregated through majority voting (classification) or av- 
eraging between them (regression) . 

Random aggregating is only a theoretical ensemble method. When large data 
sets are available, subsampled bagging can simulate random aggregating, using an 
uniform probability distribution and resampling techniques, assuming that the 
large available data set and the uniform probability distribution are good approx- 
imations respectively of the universe population and of the unknown probability 
distribution. 

Breiman showed that in regression problems, random aggregation of predic- 
tors always improves the performance of single predictors, while in classification 
problems this is not always the case, if poor base predictors are used [1]. The 
improvement depends on the stability of the base learner: random aggregating 
and bagging are effective with unstable learning algorithms, that is when small 
changes in the training set can result in large changes in the predictions of the 
base learners [3]. 
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Breiman showed also that bootstrap aggregating {bagging) techniques im- 
prove the accuracy of a single predictor reducing mainly the variance component 
of the error [1,2]. 

Recently bias-variance decomposition of the error has been applied as a tool 
to study the behaviour of learning algorithms and to develop new ensemble 
methods well-suited to the bias-variance characteristics of the base learners [4] . 
In this paper we extend to bagged and random aggregated ensembles of Sup- 
port Vector Machines (SVMs) the bias-variance analysis performed on single 
SVMs [5], 

The main purpose of this contribution consists in quantitatively evaluat- 
ing the variance reduction properties of both random aggregating and bagging, 
through an extended experimental bias-variance decomposition of the error. In 
this way we can quantitatively verify the theoretical results obtained by Breiman 
about the variance reduction properties of random aggregating, and we can also 
understand the extents to which Breiman’s results obtained for random aggre- 
gating can be extended to bagging. 

The experimental results suggest that subsampled bagged ensembles of SVMs 
could be successfully applied to very large scale data mining problems, as they 
significantly reduce the variance component of the error with respect to a single 
Support Vector Machine. In order to verify this hypothesis, we performed some 
preliminary experiments on a large synthetic data set, comparing the accuracy 
and the computational time of single SVMs trained on the entire large training 
set with ensembles of SVMs trained on small subsamples of the available data. 

2 Random Aggregating, Snbsampled 
and Classical Bagging 

There are close relationships between random aggregating, subsampled and clas- 
sical bagging: bagging can be interpreted as an approximation of random aggre- 
gating. Subsampled bagging, in turn, can be interpreted as an approximate im- 
plementation of random aggregating, where the universe population is replaced 
by a large training set T>, and subsamples are randomly drawn from T> according 
to an uniform probability distribution. 

2.1 Random Aggregating 

Let H be a set of m points drawn identically and independently from U according 
to P, where U is a, population of labeled training data points (xj,tj), and P{x, t) 
is the joint distribution of the data points in U, with x G 

Let £ be a learning algorithm, and define /d = £(P>) as the predictor 
produced by £ applied to a training set D. The model produces a predic- 
tion Jd{x) = y. Suppose that a sequence of learning sets {Dk\ is given, each 
i.i.d. from the same underlying distribution P. Breiman proposed to aggregate 
the fo trained with different samples drawn from U to get a better predictor 
Ja{x,P) [1]. For regression problems tj G M and fA{x,P) = Eolfoix)], where 
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Ed['] indicates the expected value with respect to the distribution of D, while in 
classification problems tj G S C N, and Ja{x, P) = argmax^ \{k\fDk{x) = j}|. 

As the training sets D are randomly drawn from U , we name the procedure 
to build fA random aggregating. In order to simplify the notation, we denote 
fA{x,P) as fA{x). Considering regression problems, if T and X are random 
variables having joint distribution P, the expected squared loss EL for the single 
predictor fjj{X) is: 

EL = Ed[Et,x[{T - fD{X)f]] (1) 

while the expected squared loss EL a for the aggregated predictor is: 

ELa = Et.M(T- fA{X))^] (2) 

Breiman showed [1] that EL > EL a- This disequality depends on the insta- 
bility of the predictions, that is on how unequal the two sides of eq. 3 are: 

EoifoiX)]^ < EnifUX)] (3) 

There is a strict relationship between the instability and the variance of the base 
predictor. Indeed the variance V{X) of the base predictor is: 

C(X) = ED[{fD{X) - ED[fD{X)]f] 

= ED[fUX)]-ED[fD{X)f (4) 

Comparing eq. 3 and 4 we see that higher the instability of the base classifiers, 
higher their variance is. In other words the reduction of the error in random 
aggregating is due to the reduction of the variance component (eq. 4) of the 
error, and hence if V{X) is quite large we may expect a considerable reduction 
of the error. 

Breiman showed also that in classification problems, as in regression, ag- 
gregating relatively good predictors can lead to better performances, as long 
as the base predictor is unstable, whereas, unlike regression, aggregating poor 
predictors can lower performances. 

2.2 Bagging 

Bagging is an approximation of random aggregating, for at least two reasons. 
First, bootstrap samples are not real data samples: they are drawn from a data 
set D, that is in turn a sample from the population U . On the contrary /a uses 
samples drawn directly from U . Second, bootstrap samples are drawn from D 
through an uniform probability distribution, which is only an approximation of 
the unknown true distribution P. 

Moreover each base learner, on the average, uses only 63.2% of the available 
data with bagging, and so we may expect for each base learner a larger bias, as 
the effective size of the learning set is reduced. This can also affect the bias of 
the bagged ensemble that critically depends on the bias of the component base 
learners: we could expect sometimes an increment of the bias of the bagged en- 
semble with respect to the unaggregated predictor trained on the entire available 
training set. 
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2.3 Subsampled Bagging 

While bagging had been successfully applied to different classification and re- 
gression problems [2,6,7], random aggregating is almost ideal, because in most 
cases the true distribution P is unknown and often only small data sets are 
available. 

In order to obtain an approximation of random aggregating, we can approx- 
imate the universe U with a large data set T> from which subsampled data, that 
is samples D G T> much smaller than T>, are randomly drawn with replacement. 

In real problems, if we have very large learning sets, or on-line available 
learning data, we could use subsampled bagging in order to overcome the space 
complexity problem arising from learning too large data sets, or to allow on- 
line learning [8] . For instance most of the implementations of the SVM learning 
algorithm have a 0{n^) space complexity, where n is the number of examples. If 
n is relatively large (e.g. n = 10®) we need room for 10^^ elements, a too costly 
memory requirement for most current computers, even if SVMs implemented 
through decomposition methods can achieve a certain reduction of space and 
time complexity [9]. As an alternative, we could use relatively small data sets 
randomly drawn from the large available data set, using random subsampling 
and aggregation techniques [8, 10]. 

3 Bias— Variance Decomposition of the Error 

Bias-variance decomposition of the error has been recently applied as a tool to 
analyze learning algorithms and ensemble methods [5, 4] . The main purpose of 
this analysis consists in discovering relationships between properties and param- 
eters of learning algorithms and ensemble methods with respect to their bias- 
variance characteristics, in order to gain insights into their learning behaviour 
and to develop ensemble methods well-tuned to the properties of a specific base 
learner [11]. 

Here we present only a brief overview of the main concepts of Domingos’ bias- 
variance decomposition of the error theory [12] necessary to understand the bias- 
variance analysis of random aggregating and bagging. For more details about 
the application of Domingos theory to the bias-variance analysis of ensemble 
methods see [11]. 

Expected loss depends on the randomness of the training set and the target. 
Let £ be a learning algorithm, and define fo = ^{D) as the classifier produced 
by £ applied to a training set D, where D is a set of m points {xj,tj), tj G 
C, Xj G d G N, drawn identically and independently from the ’’universe” 
population U according to P{x, t), the joint distribution of the data points in U. 
The model produces a prediction foix) = y. Let L{t, y) be the 0/1 loss function, 
that is L{t,y) = 0 if y = t, and L{t,y) = 1 otherwise. The expected loss EL 
of a learning algorithm £ at point x can be written by considering both the 
randomness due to the choice of the training set D and the randomness in t due 
to the choice of a particular test point (x,t): EL{£,x) = E£>[Et[L{t, foix))]], 
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where Ed[-] and Et[-] indicate the expected value with respect to the distribution 
of D, and to the distribution of t. 

Purpose of bias-variance analysis. It consists in decomposing the expected 
loss into terms that separate the bias and the variance. To derive this decompo- 
sition, we need to define the optimal prediction and the main prediction: bias and 
variance can be defined in terms of these quantities. The optimal prediction y* 
for point x minimizes Et[L{t, y)] : y*{x) = argmin^ Et[L{t, y)]. The main predic- 
tion yra at point x is defined as pm = argmin^/ ED[L{fD{x),y')]. For 0/1 loss, 
the main prediction is the class predicted most often by the learning algorithm 
C when applied to training sets D. 

Bias and variance. The bias B{x) is the loss of the main prediction relative 
to the optimal prediction: B{x) = L{y^.,ym). For 0/1 loss, the bias is always 0 
or 1. We will say that C is biased at point x, if B{x) = 1. The variance V{x) 
is the average loss of the predictions relative to the main prediction: V{x) = 
ED[L{ym, foix))] It captures the extent to which the various predictions fnix) 
vary depending on D. 

Unbiased and biased variance. Domingos distinguishes between two opposite 
effects of variance on the error with classification problems: in the unbiased 
case variance increases the error, while in the biased case variance decreases the 
error. As a consequence we can define an unbiased variance, Vu{x), that is the 
variance when B(x) = 0 and a biased variance, Vb{x), that is the variance when 
B{x) = 1. Finally we can also define the net variance Vn{x), to take into account 
the combined effect of the unbiased and biased variance: Vn{x) = Vu{x) — Vb{x) 

Noise- free decomposition of the 0/1 loss. The decomposition for a single point 
X can be generalized to the entire population by defining E^l] to be the expec- 
tation with respect to P{x). Then we can define the average bias Ex[B{x)], the 
average unbiased variance Ex\Vu{x)], and the average biased variance Ex\Vb{x)]. 
In the noise-free case, the expected loss over the entire population is: 

E^[EL{C,x)] = E4B{x)]+ E^[Vu{x)] - E^[Vb{x)]. 

4 Experimental Bias— Variance Analysis in Bagged 
and Random Aggregated Ensembles of SVMs 

The main purpose of the experiments consists in understanding the effect of 
bagging and random aggregating on the bias and variance components of the 
error. We used gaussian, polynomial and dot-product Support Vector Machines 
(SVMs) as base learners. 

4.1 Experimental Set Up 

For bagging we draw with replacement from a learning set P n = 100 samples 
Di of size s = 100, according to an uniform probability distribution. From each 
Di, 1 < z < n, we generate by bootstrap m = 60 replicates Dij, collecting them 
in n different sets = {Dij}fLi. 
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Fig. 1. Comparison of bias- variance decomposition between single gaussian SVMs 
(lines labeled with crosses) and gaussian SVM ensembles (lines labeled with trian- 
gles), while varying cr and for a fixed values of the regularization parameter (C = 1) 
with the Letter-Two data set: (a) Bagged ensemble (b) Random aggregated ensemble. 



We approximated random aggregating through subsampled bagging. In par- 
ticular we draw with replacement from T> n = 100 sets of samples Di, according 
to an uniform probability distribution. Each set of samples Di = is 

composed by m = 60 samples Dij drawn with replacement from T>, using an 
uniform probability distribution; each Dij is composed by 100 examples. Note 
that in this case each sample Dij is directly drawn from D and not from the 
samples Di C V. In order to evaluate the bias-variance decomposition of the 
error for single SVMs, we used 100 samples Di of size 100. 

We computed the error and its decomposition in bias, net-variance, unbiased 
and biased variance with respect to a separated large testing set T, comparing 
the bias-variance decomposition of the error in random aggregated, bagged and 
single SVMs. To perform this computationally intensive analysis (more than 10 
millions of single SVMs were trained and tested) we used a cluster of worksta- 
tions, and we developed new classes and specific C-I-+ applications, using the 
NEURObjects software library [13] and the SVMlight applications [9]. 

4.2 Results 

In particular we analyzed the relationships of the components of the error with 
the kernels and kernel parameters, using data sets from UCI [14] (Waveform, 
Grey-Landsat, Letter-Two, Letter-Two with added noise. Spam, Musk) and the 
P2 synthetic data set^. We achieved a characterization of the bias- variance de- 
composition of the error in bagged and random aggregated ensembles that re- 
sembles the one obtained for single SVMs [5] (Fig. 1. For more details see [11]). 
The results show that bagging reduces the error with respect to single SVMs, 
but not so largely as random aggregating. Fig. 2 summarizes the main outcomes 
of our experiments: bagged ensembles of SVMs achieved a reduction of the rela- 
tive average error from 0 to 20 % with 35% decrement of the net- variance, while 

^ The application gensimple for the automatic generation of the P2 data set is avail- 
able at ftp://ftp.disi.unige.it/person/ValentiniG/BV/gensimple. 
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(a) (b) 

Fig. 2. Comparison of the relative error, bias and unbiased variance reduction between 
bagged and single SVMs (lines labeled with triangles), and between random aggregated 
and single SVMs (lines labeled with squares). B/S stands for Bagged versus Single 
SVMs, and R/S for random aggregated versus Single SVMs. Results refers to 7 different 
data sets, (a) Gaussian kernels (b) Polynomial kernels. 



random aggregated ensembles obtained about a 90 % net- variance decrement, 
with a relative reduction of the error from 20 to 70 %. Note that a negative 
sign in the relative bias reduction means that bagging with some data sets may 
increment the bias with respect to single SVMs (Fig. 2). 

These results are not surprising: bagging is a variance reduction method, but 
we cannot expect as a large decrement of the variance as in random aggregating. 
Indeed with random aggregating the base learners use more variable training sets 
drawn from U . In this way random aggregating exploits information from the 
entire population U , while bagging can exploit only the information from a single 
data set D drawn from U, through bootstrap replicates of the data drawn from 
D. Note that these results do not show that random aggregating is ’’better” than 
bagging, as they are referred to differently sized data sets, but rather they confirm 
the theoretical variance reduction properties of random aggregating and bagging. 
Moreover they show that with well-tuned and accurate base learners a very large 
variance reduction is obtained only with random aggregating (Fig. 2). Anyway 
note that in our experiments we used small samples: we could expect that with 
larger samples the differences between random aggregating and bagging will be 
smaller. 

5 Subsampled Bagged Ensembles of SVMs 
for Large Scale Classification Problems 

The bias-variance analysis of random aggregated ensembles shows that the vari- 
ance component of the error is strongly reduced, while the bias remains un- 
changed or it is lowered (Fig. 2). Breiman proposed random subsampling tech- 
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Table 1. Comparison of the results between single and subsampled bagged SVMs (see 
text). 





N .samples 


Error 


Time 


AError 


Speed — up 


S-SVM 


100000 


0.0023 


5474.68 






SB-SVM 


10000 


0.0032 


4089.82 


0.39 


1.3 


SB-SVM 


5000 


0.0043 


855.85 


0.87 


6.4 


SB-SVM 


2000 


0.0076 


124.93 


2.30 


43.8 


SB-SVM 


1000 


0.0127 


38.49 


4.52 


142.2 


SB-SVM 


200 


0.0358 


2.92 


14.57 


1874.9 


SB-SVM 


100 


0.0539 


1.69 


22.43 


3239.4 



niques for classification in large databases, using decision trees as base learn- 
ers [8], and these techniques have been also successfully applied in distributed 
environments [10]. 

These facts suggest to apply subsampled bagging to large scale classification 
problems, using SVMs as base learners, considering also that the SVM algorithm, 
as well as other learning algorithms, does not scale too well when very large 
samples are available [9]. 

To get some insights into this hypothesis, we performed a preliminary ex- 
periment with the synthetic data set P2, using a quite large learning set of 10® 
examples, and comparing the results of single and subsampled bagged SVMs on 
a separated large testing set. Tab. 1 summarizes the results of the experiments: 
S-SVM stands for single gaussian SVMs trained on the entire available learning 
set; SB-SVM stands for Subsampled Bagged SVMs trained on subsamples of the 
available training set, whose cardinality are shown in the column N. samples. The 
column Error shows the error on the separated test set (composed by 100000 
examples); the column Time shows the the training time in seconds, using an 
AMD Athlon 2000-1- processor (1.8 GHz) with 512 Mb RAM. AError is the 
relative error increment using an ensemble instead of a single SVM, and the last 
column represents the speed-up achieved. 

The results show that with subsampled bagging we can obtain a consistent 
speed-up, but with a slight decrement in the accuracy. For instance, we need 
two minutes for training ensembles with 1 /50 of the available data against one 
hour and a half for a single SVM trained on the entire data set (Tab. 1), with 
an increment of the error from 0.23 % to 0.76 % (a small absolute difference, but 
a high relative error increment). Note that if accuracy is not the main goal, we 
achieve most of the error reduction with about 30 base learners (Fig. 3), and 
using a parallel implementation we may expect a further speed-up linear in the 
number of the base learners. 

These results confirm similar ones obtained in [15]. Moreover the bias-va- 
riance analysis explains the behaviour of subsampled bagging when small learn- 
ing sets are used: the increment of the bias due to the reduction of the size of the 
samples is partially counterbalanced by the reduction of the net-variance (data 
not shown), and on the other hand a substantial speed-up is obtained. 
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Fig. 3. Error (Log scale) as a function of the number of base learners (gaussian SVM) 
employed. The different curves refer to ensembles with base learners trained with dif- 
ferent fractions of the learning set. 



6 Conclusions 

The extensive experimental analysis of bias-variance decomposition of the error 
in random aggregated and bagged ensembles of SVMs shows a consistent re- 
duction of the net- variance with respect to single SVMs. This behaviour is due 
primarily to the unbiased variance reduction, while the bias remains substan- 
tially unchanged. 

In our experiments random aggregating always reduces the variance close 
to 0, while classical bagging reduces the variance only of about 1/3, confirming 
the theoretical results of Breiman, and providing a quantitative estimate of the 
variance reduction both in bagged and random aggregated ensembles of SVMs. 

Preliminary results show also that subsampled bagging, as an approximation 
of random aggregating, can be in practice applied to classification problems when 
single SVMs cannot comfortably manage very large data sets, if accuracy is not 
the main concern. 
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Abstract. In this paper, we report an experimental comparison between two 
widely used combination methods, i.e. sum and product rules, in order to de- 
termine the relationship between their performance and classifier diversity. We 
focus on the behaviour of the considered combination rules for ensembles of 
classifiers with different performance and level of correlation. To this end, a 
simulation method is proposed to generate with fixed accuracies and diversity a 
set of two classifiers providing measurement outputs. A measure of distance is 
used to estimate the correlation among the pairs of classifiers. Our experimental 
results tend to show that with diverse classifiers providing no more than three 
solutions, the product rule outperforms the sum rule, whereas when classifiers 
provide more solutions, the sum rule becomes more interesting. 



1 Introduction 

Combining classifiers is now a common approach for improving the performance of 
recognition systems. Diversity among classifiers is considered as a desired character- 
istic to achieve this improvement. Though still an ill-defined concept, the diversity 
has been studied recently in many multiple classifier problems such as analysing the 
theoretical relationships between diversity and classification error [12], using diver- 
sity for identifying the minimal subset of classifiers that achieves the best prediction 
accuracy [1, 5] or building ensembles of diverse classifiers [8]. For this last direction, 
i.e. construction of diverse classifiers, several implicit and explicit methods have heen 
investigated. Duin [3] lists the main implicit approaches to huild diverse classifiers. 
The principal one is to use different data representations adapted to the classifiers [7]. 
The diversity can also be implicitly encouraged either by varying the classifier topol- 
ogy [11]> the learning parameters [6] or hy training each classifier on different parts 
of the data which is done for example hy Bagging [2]. On the contrary, the aim of 
explicit methods is to design a set of classifiers by asserting the diversity measure in 
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the process of building ensembles [8, 9]: the advantage of the incorporation of diver- 
sity measure is to control a priori the diversity between the classifier outputs in order 
to facilitate the analysis of the combination behavior. Although interesting, the exist- 
ing works are rather limited since the generated classifier ensembles can only be used 
to study the performance of abstract-level combination methods and are thus not 
applicable to evaluate the measurement type combination methods. 

For measurement type combination methods, the use of independent classifiers is 
considered to be essential to achieve better performance. Therefore, one can found in 
the literature numerous studies dealing with the evaluation of measurement type 
combination methods with independent classifiers. Among these studies, one can 
refered the works presented in [7] and [11]. For instance, the authors in [7] present an 
experimental comparison between various combination schemes such as the product 
rule, sum rule, min rule, max rule, median rule and majority voting. They show that 
under the assumption of independent classifiers, the sum rule outperforms other clas- 
sifier combination schemes and is more resilient to estimation errors. Under the same 
assumption, the work presented in [11] compares the mean rule and the product rule 
and confirms that in the case of a two-class problem, the combination rules perform 
the same, whereas in the case of problems involving multiple classes and only with 
poor posterior probability estimates, the mean rule is to be used. The analysis of the 
performance of the measurement type combination methods according to diversity 
has on the contrary received less attention [4, 10, 12]. In [4] for example, the authors 
report a theoretical and experimental comparison between the simple average and the 
weighted average. They show that weighted averaging significantly improves the 
performance of simple averaging only for ensembles of classifiers with highly imbal- 
anced performance and correlations. In [10], the author demonstrates on simulated 
data that product can perform better than sum with independent classifiers. With 
correlated classifiers, sum outperforms product whatever the performance of classifi- 
ers. As we can see from the studies of sum and product rules reported in the literature, 
the experimental results about the performance of the two rules are still conflicting. 
For instance, in [7], it was found that sum outperforms product under the assumption 
of classifier independence, while, in [10], product can perform better than sum. One 
can also note that to the best of our knowledge, there are no reported results for these 
two rules (i.e. product and sum) concerning the relationship between their perform- 
ance and classifier diversity. 

In this paper, we report an experimental comparison between sum and product 
rules according to classifier dependency. The aim of our experiments is to investigate 
the behaviour of the above fusion rules for classifier ensembles with different level of 
correlation. To this end, we propose a simulation method for building pairs of corre- 
lated classifiers. The idea behind this method is to use the simulator developped in 
[13] for generating classifier outputs according to fixed accuracies and diversity. A 
distance measure is used to estimate this diversity between the classifier confidences. 
The paper is thus organized as follows. Section 2 presents the diversity measure used 
to generate the classifier outputs. Section 3 describes the principle of the proposed 
method. The algorithm for generating two classifiers according to a predefined dis- 
tance and fixed accuracies is presented in section 4. First experimental results are 
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reported in section 5 to show the efficiency of the proposed method. Conclusion is 
drawn in section 6. 



2 Measuring the Diversity 

In a recent work [13], we have proposed a classifier simulator for the evaluation of 
measurement type combination methods. Though this simulator was able to generate 
randomly measurement type outputs (i.e. a confidence value associated to each solu- 
tion proposed by the classifier), it should be noted however that this classifier simula- 
tor was only able to create independent classifiers. We have therefore investigated the 
generation of correlated classifiers. Building such ensembles of correlated classifiers 
needs to use an appropriate measure to control the diversity between the classifier 
confidences. Distance, correlation or mutual information are possible measures that 
can be used to estimate the diversity between confidences [1]. Although used in many 
classification problems, there is no analytic proof that a particular measure is prefer- 
able to another. 

In this work, we use the distance to estimate the diversity between two classifiers. 
Precisely, consider two classifiers, A and B, providing S outputs for a N-class classi- 

fication problem. Let S- and S- (i=l to S) be their outputs, each of which is com- 
posed of up to K solutions with K > 1 and K < N. An example of output lists is given 
in Figure 1 (the generated confidences are normalized). 



Class to recognize The proposed labels Confidences 



S outputs 







2 [79.88] 1 [20.12] 

1 [69.06] 3 [30.61] 2 [ 0.33] 

2 [ 100 . 00 ] 

1 [51.83] 3 [48.17] 

1 [ 100 . 00 ] 

1 [ 63.08] 2 [ 36.40] 3 [ 0.52] 



List of solutions 



Fig. 1. Description of the classifier outputs. 



Each output sf (respectively sf ) can be represented as a confidence vector 
(c^ ct,') where Cif is the confidence associated to class i. As we assume that the 

k i 1 ’ ’ lAi ^ y 

N 

confidences are normalized, we have therefore = 100 with c,yG[0,100] . 

j=i 

The diversity between the two classifiers A and B providing this type of outputs, 
is defined as: 
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( 1 ) 



A B 

where is the Hamming distance between the outputs and given by: 



I / A B \ 

d(s. ,s. ) 



1 ^ 

— E 
200 % 



A 



B 



( 2 ) 



200 is the maximal distance between two confidence vectors so that the distance 
d(s,^,sf) is normalized and varies between 0 and 1. d{st,sP) =0 means that the out- 

A B 

puts 5; and 5, contain the same solutions with the same confidences. On the con- 
trary, d(sf-,sf) =1 indicates that the two outputs are totally different (there is no in- 
tersection of class labels as shown in the following example for a 5-class problem): 



Classifier A 

1: 1 [ 75.34] 2 [ 24.66] 

1: 1 [100.00] 

1: 2 [100.00] 



Classifier B 
1:3 [60.00] 4 [40.00] 
1: 2 [50.00] 5 [50.00] 
1: 4 [100.00] 



We slightly favoured the distance measure for controlling the diversity for the follow- 
ing reasons: is a simple measure that does not depend on the classifier accuracy 

and its range is limited between 0 and 1 representing the extremes of positive and 
negative dependency. We thus assume that for building classifier ensembles, control- 
ling the diversity by means of a distance is somewhat equivalent to control the corre- 
lation between the classifiers. This idea is illustrated in Figure 2 showing the relation- 
ship between correlation and distance measured for a 5-class problem between 50 
pairs of classifiers having the same accuracy p=Q.6. 




Fig. 2. Relationship between correlation and distance with 50 pairs of classifiers (where x-axis 
stands for the distance and y-axis for the correlation). 
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3 Principle of the Method 

As noted previously, the output generation of the first classifier of the team is realized 
by the classifier simulator presented in [13]. In this section, we are interested in build- 
ing a second classifier B from the outputs of the first classifier A according to a de- 
sired distance S . 

To illustrate the principle of the method, let us consider, without loss of generality, 
the generation of B outputs from A outputs for a 3-class problem. In this case, any 

output of classifier A, say =(c/f,c^j,c,j) , belongs to the plane P defined by 

3 

=100 with c, ^[0,100] as shown in Figure 3 (the plane P is the grey triangle). 

1=1 



C| 




B A 

Fig. 3. Determination of 5 . from S- through a barycentric method. 

Now, generating an output sf of classifier B at a distance d from an output sf of 
classifier A comes down to determine the intersection between the plane P and the set 
of points sf which respect (see figure 3): 

with = 100, =100 ; cf,cf g [0,100] (3) 

1=1 1=1 1=1 

There are of course an infinity of solutions. However, one solution to this problem 
can be simply obtained through a barycentric calculus of sf . We need for that an- 

other point, say S,- , belonging to the plane P, which is quite far from S- (see fig- 
ure 3) so that; 

£|c,f-c,f|=J and ^|c,^-cP|=Jmax (4) 

1=1 1=1 
It comes therefore from constraints (4) that: 
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S S max 

Through algebraic manipulations of equation (5), it comes finally that each confi- 
dence of sf can be obtained as: 






c „ = 



S)c^ + Sc^ 



( 6 ) 



4 The Generation Algorithm 

The aim of the procedure described in this section is to generate automatically, for a 
N-class problem, S outputs of two classifiers (namely A and B) according to a fixed 
distance and accuracies and . In this procedure, we first generate the outputs 
of classifier A according to desired accuracy and we use these outputs to generate 
the outputs of classifier B through the method described in section 3. The generation 
of classifier A outputs is beyond the scope of this paper and has been discussed ear- 
lier in a separate paper [13]. The idea proposed here is first to generate the outputs of 
classifier B according to fixed distance and next modify these outputs for respect- 
ing the desired accuracy p^. This consists in generating temporary classifier outputs 
which are different from those of classifier A (i.e. having a distance of =1) and 
using them in the formula (6) to determine the confidences of the outputs of B. This 
process tends however to generate more than K solutions to fit the desired distance. 
Therefore, in order to respect a desired accuracy in the Top K solutions of classifier 
B, outputs of more than K solutions are truncated to K solutions and confidences are 
re-distributed among the remaining classes. 

The final step of the generation procedure is to respect the desired accuracy p^ i.e. 
obtain pg*S true class labels in the outputs of classifier B. Let us recall that the accu- 
racy pg is the ratio of the number of true classes that appear in the top K solutions of 
the output lists among the total number S of outputs. Respecting only this accuracy is 
easy. For example, if we should increase the number of true class labels, it is suffi- 
cient to search the outputs that do not contain the true class and modify one of their 
class labels and replace it by the true class. However, when we should respect at the 
same time the accuracy and the desired distance the process is not straightforward. 
The distance measure used in this work is based on the confidence values associated 

A B 

to the class labels of the two classifier outputs and^^ for (i=l to S). Therefore, 

modifying the class labels needs to take into account the confidence values of the two 
classifiers. 

To illustrate this procedure, suppose that we have generated two outputs for classi- 
fier A and that we want to determine the outputs of classifier B according to a desired 
distance = 0.6 for a 3-class problem with K=2 and ^^=100%. The temporary out- 
puts (classifier C) are first generated and next used through equation(6) to derive 
outputs of classifier B, for example: 
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Classifier A Classifier C Classifier B 

1 : 1 1100 . 00 ] 1 : 2 190 . 00 ] 3110 . 00 ] 1 : 2 154 . 00 ] 1 140 . 00 ] 31 6 . 00 ] 

1 : 2 1100 . 00 ] 1 : 3 1100 . 00 ] 1 : 3 160 . 00 ] 2 140 . 00 ] 

As noted above, the barycentric method tends to generate more than K solutions for 
classifier B (as this is the case for the first output). To respect K, the last solution of 

the first output of B (i.e. 3 [10.00]) has therefore to be eliminated. One solution to do 

that is to add the confidence of this solution to the first one (i.e. 2 [54.00]) in order to 
respect the predefined distance. We do not add the confidence to the second solution 
because the class label “1” exists in the output of classifier A. After this elimination, 
we obtain: 



Classifier A Classifier B 

1 : 1 1100 . 00 ] 1 : 2 160 . 00 ] 1 140 . 00 ] 

1 : 2 1100 . 00 ] 1 : 3 160 . 00 ] 2 140 . 00 ] 

Now, to respect the accuracy p^, we should add one true class to the second output. In 
this case, it is possible to choose only the first solution because the second one has a 
class label that exists in the output of A. This leads finally to the following outputs 
which respect both the desired distance and the fixed accuracy of classifier B: 

Classifier A Classifier B 

1 : 1 1100 . 00 ] 1 : 2 160 . 00 ] 1 140 . 00 ] 

1 : 2 1100 . 00 ] 1 : 1 160 . 00 ] 2 140 . 00 ] 

The algorithm for the generation of classifier B outputs from classifier A is presented 
in Table 1. 



Table 1. Generation of output lists of classifier B from classifier A. 

Inputs : S (number of outputs) , (outputs of classifier A) , 

Pj (the desired accuracy of classifier B) , 

(the fixed distance between A and B) 

Outputs : sf ■ outputs of classifier B (i=l to S) 

Begin 

For each output i=l to S Do 

Generate an output sf different from sf 

Determine the output sf of classifier B using (6) 

Compute the accuracy p of classifier B 
If (p < Pb ) Then 

Select among classifier B outputs |p-Pj|*S outputs which do not 

contain a true class label T^ 

Else 

Select among classifier B outputs |p-Pj|*S outputs which contain 
a true class label T, 

For each selected output sf BP 

If Tj t sf' Then select a class label that t sf ^nd replace it by T^ 
Else select a class label that e sf and replace it by T^ 



End 
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5 Experimental Results 

In this section, we investigate the performance of sum and product rules according to 
diversity in a two classifier case for a 10-class problem. For this experiment, we 
simulate 50 pairs of classifiers for each of different values of diversity D. The classi- 
fiers provide 1000 outputs according to the same accuracies p^ = = 50%, 70% or 

90% with K=3 (Top3 accuracy). The results are evaluated according to the maximal 
accuracy obtained by the combination over 50 runs. In the first experiment, we show 
that whatever the value of p, the performance of the two rules increase as the diver- 
sity between the two classifiers increases (see figure 4). 





>90 

>70 

>90 



Sum 



Product 



Fig. 4. The performance of sum and product rules vs. diversity for different values of p (where 
the X-axis stands for the distance between the classifiers and Y-axis stands for the recognition 
rates of the combination rules). 



In the second experiment, we examine the effect on Sum and Product rule perform- 
ance of the number K of solutions for different degrees of classifier diversity. The 
experimental conditions are the same as in the first experiment (10-class problem, 
two classifiers, 50 pairs of classifiers for each of different values of diversity D) but 
we compare here the two rules when the length of the solution list increases (K=3 or 
5) with p^=p=9Q%. The accuracy of the combination methods versus the diversity 
measure is depicted in figure 5. The results show that in the situation of dependent 
classifiers there are no difference between the two combination rules whatever the 
number of solutions provided (three or five). But, as the diversity between the classi- 
fiers increases, the difference between Sum and Product appears. When the classifiers 
are different (D>0.6) and provide no more than three solutions, the product rule 
achieves a significant improvement over the sum rule. However, when the classifiers 
provide more solutions (five), the sum rule becomes more interesting. This would 
mean that the product rule is more sensitive to the number of solutions and less effi- 
cient especially for diverse classifiers [7]. These preliminary results are obviously to 
be confirmed by a more intensive generation of pairs of classifiers (currently 50 runs 
is obviously not sufficient) but they show that our method can be a useful tool to 
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p (Top3) = 90% 



p (Top5) = 90% 



Fig. 5. Comparison of sum and product rules vs diversity for different values of K (where the 
X-axis stands for the distance between the classifiers and Y-axis stands for the recognition rates 
of the combination rules). 



better understand the effect of classifier diversity on the behavior of measurement- 
type combination rules. 



6 Conclusions 

In this paper, we have proposed a new simulation method for the generation of de- 
pendent classifier outputs. An algorithm for building two classifiers with specified 
accuracies and a distance measure between them is presented. Given the outputs of 
the first classifier, the proposed method relies on a two-step procedure which first 
generates the outputs of the second classifier according to the specified distance and 
next modify these outputs in order to respect the predefined accuracy (recognition 
rate). By a simple example, we have shown that the proposed method can help to 
clarify (in the case of two classifiers) the conditions under which the measurement 
combination methods can be used according to diversity. As a future work, we plan to 
extend the proposed method to generate more than two classifiers and control all the 
between-pair distances. 
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Abstract. The Error Correcting Output Codes (ECOC) framework provides a 
powerful and popular method for solving multiclass problems using a multitude 
of binary classifiers. We had recently introduced [10] the Binary Hierarchical 
Classifier (BHC) architecture that addresses multiclass classification problems 
using a set of binary classifiers organized in the form of a hierarchy. Unlike 
ECOCs, the BHC groups classes according to their natural affinities in order to 
make each binary problem easier. However, it cannot exploit the powerful error 
correcting properties of an ECOC ensemble, which can provide good results 
even when the individual classifiers are weak. In this paper, we provide an em- 
pirical comparison of these two approaches on a variety of datasets, using well- 
tuned SVMs as the base classifiers. The results show that while there is no clear 
advantage to either technique in terms of classification accuracy, BHCs typi- 
cally achieve this performance using fewer classifiers, and have the added ad- 
vantage of automatically generating a hierarchy of classes. Such hierarchies of- 
ten provide a valuable tool for extracting domain knowledge, and achieve better 
results when coarser granularity of the output space is acceptable. 



1 Introduction 

Classification techniques such as the k-nearest neighbors and multi-layered percep- 
tron can directly deal with multiclass problems. However, in difficult pattern recogni- 
tion problems involving a large number of classes, it has often been observed that 
obtaining a classifier that discriminates between two (groups of) classes is much eas- 
ier than one that simultaneously distinguishes among all classes. This observation has 
prompted substantial research on using a collection of binary classifiers to address 
multiclass problems. Further interest in this area has been fuelled by the popularity of 
classifiers such as SVMs [1] whose native formulation is for binary classification 
problems. While several extensions to multiclass SVMs have been proposed, a careful 
study in [2] showed that none of these approaches are superior to using a set of binary 
SVMs in an “All Pairs” framework. 

The One-Vs-All method [3], the Round Robin Method [4] or the All Pairs method 
[5] (also known as pairwise method), and the Error Correcting approaches [6] [7] [8] 
are some of the techniques proposed for solving multiclass problem by decomposing 
the output space. All of these methods can be unified under a common framework 
wherein the output space is represented by a binary code matrix, which depends on 
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the technique being used [7] [9]. The above techniques can also be considered as two- 
level approaches in which the second level provides some relatively simple mecha- 
nism to output the final class label based on the decisions obtained from a set of bi- 
nary “base” classifiers at the first level. Another common characteristic of these 
methods is that they do not take into consideration the affinities among the individual 
classes. As a result, some of the groupings might result in complex decision bounda- 
ries. In particular, this suggests the need to use powerful base classifiers in the ECOC 
framework (decision stumps will not do!). The performance of ECOC methods also 
hinges heavily on the choice of the code matrix, and though multiple solutions have 
been proposed, none of them are guaranteed to produce the optimal matrix for a prob- 
lem at hand. 

The BHC [10] is a multiclassifier system that was primarily developed to deal with 
multiclass hyperspectral data. Like the multiclassifier systems mentioned above, the 
BHC decomposes a C class problem into (C-1) binary meta-class problems. However, 
the grouping of the individual classes into meta-classes is determined by the class 
distributions. A recent work [11], that compared the performance of BHC against that 
of an ECOC system, while using “base” Bayesian classifiers in both systems, showed 
the superior performance of the BHC method on real-world hyperspectral data. Since 
the BHC not only yields valuable domain knowledge, but also resolves the problem of 
having to come up with an optimal code matrix, we decided to evaluate the perform- 
ance of the BHC against the ECOC methods on a wide range of standard datasets. 
SVM, being a popular and powerful binary classifier was used as the “base” classifier 
in both the systems. Our experiments show that the performance of the BHC is com- 
parable to that of ECOC classifiers while remaining robust for small training sets. We 
also show that besides using far lesser number of classifiers in most cases, the binary 
trees produced by the BHC are consistent with those a human expert might have con- 
structed when given just the class labels. 



2 Background 

2.1 Binary Hierarchical Classifier 

The Binary Hierarchical Classifier (BHC) [10] involves recursively decomposing a 
multiclass (C-classes) problem into (C-1) two meta-class problems, resulting in (C-1) 
classifiers arranged as a binary tree. The given set of classes is first partitioned into 
two disjoint meta-classes and each meta-class thus obtained is partitioned recursively 
until it contains only one of the original classes. The number of leaf nodes in the tree 
is thus equal to the number of classes in the output space. The partitioning of a parent 
set of classes into two meta-classes is not arbitrary, but is obtained through a determi- 
nistic annealing process, which encourages similar classes to remain in the same parti- 
tion [12]. Thus, as a direct consequence of the BHC algorithm, classes that are similar 
to each other in the input feature space are lumped into the same meta-class higher up 
in the tree. Interested readers are referred to [10] for details of the algorithm. Also, 
note that the BHC is an example of a coarse-to-fine strategy, which has seen several 
application-specific successes and for which solid theoretical underpinnings are be- 
ginning to emerge [13]. 
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The BHC algorithm offers a lot of flexibility in designing the classifier system. For 
instance, one can replace the Bayesian classifier at the internal nodes of the generated 
BHC tree with stronger binary classifiers such as SVM [1]. Moreover, different fea- 
ture selection methods can be used at each node that is specific to the domain of the 
input data. Note that the maximum number of classifiers required in a BHC system is 
(C-1), unlike other methods such as All-Pairs in which the number of classifiers is 
quadratic in the cardinality of the output space. Further since the BHC essentially 
clusters the classes, it yields additional information about the inherent relationships 
among the different classes. Such information can be leveraged in creating classifiers 
with varying levels of granularity, extracting simple rules and identifying those fea- 
tures that help distinguish between groups of classes. 

2.2 ECOC Classifiers 

The ECOC method of combining binary classifiers is based on the framework used in 
[14] for the NETalk system. In this method, the training data of each class is associ- 
ated with a unique binary string of length n. The different bit positions in the binary 
string represent a particular independent feature, the presence or absence of which 
helps differentiate one class from the other. During training, n binary functions are 
learned, one for each bit position. New data samples are then classified by each of the 
n classifiers, whose combined output gives rise to a n-bit binary string. The bit string 
thus obtained is then compared with the representative bit strings of each class, and 
the class with the closest distance measure to the bit string of the data sample is as- 
sumed to have generated it. 

This idea was extended in [6] where instead of generating bit strings that in some 
sense represent the input space, a matrix of code vectors having good error correcting 
properties was used. Thus, if d is the Hamming distance between any two code vec- 
tors in the matrix, then the error correcting properties of the code can correct up to 
\{d-l)l2\ errors. Hence, if we train a set of classifiers, one for each bit position then we 
can obtain the right class label even when there are |(<f-l)/2| errors. These error- 
correcting codes will be successful only when the errors made by the individual clas- 
sifiers are independent [15]. The ECOC matrices used in [6] were predefined and did 
not take into account any other factor inherent in the data other than the number of 
classes. 

An error-correcting matrix is considered to be a good one if, the codewords are 
well separated in Hamming distance and if the columns of the code matrix are uncor- 
related. The latter condition translates into having a large Hamming distance between 
pairs of columns, and also between each column and the complements of the rest. 
Binary classifiers do not differentiate between learning a particular bit string or its 
complement. Hence, even though complementary columns have the maximum Ham- 
ming distance it is important to not include either identical or complementary col- 
umns in the code matrix. Dietterich et al. [6] provide a simple set of rules to form the 
code matrices depending on the cardinality of the output space k. 

• Exhaustive codes of length 2‘‘'-l when k<=l . 

• Column selection from exhaustive codes when 8<=A:<= 1 1 . 

• For k>\\ a method based on randomized hill climbing, or on using BCH [16] 

codes to form the code matrices. 
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The above methods however do not guarantee the best possible code matrix for a 
given problem and involve some heuristics in choosing a good matrix among the 
available choices. Finding the optimal code matrix is still an open research question 
though many alternatives have been suggested. Allwein et al. [7] use the confidence 
of predictions and perform decoding based on a loss function instead of the Hamming 
distance. The performance of different choices of code matrices was also evaluated. 
Some of the code matrices used in [7] are, 

• A complete code that unifies One-Vs.-All, Error Correcting codes and All-pairs 
classification. 

• A dense random code with |101ogj(A:) | columns for a A:-class problem. The code 
matrix was selected by generating 10,000 matrices in which each column has a uni- 
form distribution of {-tl and -1 } and then choosing that matrix that had the largest 
inter-row Hamming distance. 

• A sparse random code with jlSlog^Cfe)! columns in which the elements are 0 with 
probability 0.5 and 1 or -1 with probability 0.25 each. The code matrix with the 
largest Hamming distance was again chosen as the best one. 

Crammer et al. [8] attempt to find a code matrix that is problem dependent by try- 
ing to learn a matrix with real- valued entries that has a good performance on the train- 
ing data. More sophisticated methods such as recursive ECOC [17] have also been 
proposed. 

Though there are numerous methods of combining the different binary classifiers 
in the ECOC framework, no one method has till date claimed superior performance 
over the rest in terms of the error rates on the test data. In fact, a recent evaluation [9] 
of the different multiclassifier methods shows comparable performance of the differ- 
ent classifiers and hence, suggests the use of the One-Vs.-All method (with very well 
tuned and powerful base classifiers) as being the most intuitive and simplest method 
of implementing multiclassifier systems. 



3 Experiments 

Table 1 summarizes the datasets used in our experimental evaluation. All the datasets 
are publicly available at the UCI Machine Learning Repository [18]. These datasets 
were chosen as they all have numeric attributes and have been widely used to evaluate 
ECOC systems. None of the datasets have any missing values. The training data was 
normalized to have a zero mean and unit variance and the test set was normalized 
with the mean and variance of the training data. We tested the performance of BHC 
with base Bayesian Classifiers, BHC with base SVM classifiers and ECOC with base 
SVM classifiers. 

Eor datasets with independent test sets we used the original test/train split and for 
datasets which do not contain an independent test set, we performed five-fold cross 
validation. Eor the BHC with “base” Bayesian Classifiers, the in-built feature extrac- 
tion method based on the fisher discriminant was applied at each of the internal nodes. 
This BHC method did not require the setting of any parameter and hence, the training 
set was used as is. 
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Table 1. Datasets 



No. 


Name 


# Train 


#Test 


# Classes 


# Attributes 


1. 


Glass 


214 


5C.V 


6 


9 


2. 


Yeast 


1484 


5C.V 


11 


9 


3. 


Satimage 


4435 


2000 


6 


36 


4. 


Segment 


2310 


5C.V 


7 


16 


5. 


Page-Blocks 


5473 


5C.V 


5 


10 


6. 


Pendigits 


7494 


3498 


10 


16 


7. 


Optdigits 


3823 


1797 


10 


64 


8. 


Letter 


16000 


4000 


26 


16 


9. 


Vowel 


528 


462 


11 


12 



The BHC algorithm was then modified to accommodate the SVM classifiers, 
which were implemented in Matlab using the toolbox provided by [19]. Each training 
set was split into a 90% training set and a 10% validation set, which was then used to 
tune the parameters of the SVM. RBF kernels were used since they consistently 
yielded better performance than polynomial kernels. To tune the kernel width, it was 
first set at 1 and then was increased or decreased until there was no improvement on 
the validation set. The kernel width was then set at the best possible value within the 
observed range. For both the BHC-SVM and the ECOC-SVM methods, parameter 
tuning was done individually for each validation set. BHC-SVM also used a simple 
Forward Feature Selection algorithm at each node, as the nodes towards the leaves 
tend to have far less training data than the nodes near the root. Feature selection was 
performed only when the ratio of the number of training samples to the input dimen- 
sionality was less than 5 [11]. 

For the ECOC-SVM method, the kernel parameter was tuned as in the previous 
case. The code matrices were generated by: 

• The Exhaustive method in [6] when the number of classes k<=7. 

• The Dense random code method of [7] when 7<k<=l 1 . 

• The BCH code matrix (obtained from [19]) when k>l 1. 

Table 2 shows the absolute error rates of the three different classifier schemes. The 
number in the parentheses under an accuracy rate is the standard deviation of the 
percentage accuracy over the five cross-validation runs. An asterisk indicates that the 
results are statistically significantly different from that of the BHC-SVM at the 90% 
confidence level. 

From the above table, it can be seen that the use of SVM as the internal classifier 
improves the performance of the BHC significantly. The BHC-SVM also has percent- 
age accuracies that are comparable if not better than that of ECOC-SVM in most of 
the cases. Moreover, the BHC uses fewer classifiers in almost all cases except that of 
the Letter dataset for which a BCH-like code matrix of length 15 was used. It has 
been remarked in [6] that the BCH codes have some practical drawbacks and human 
intervention is necessary to generate good shortened BCH codes. All other methods of 
binary classifier generation except for the One- Vs- All method use far more classifiers 
than the BHC while showing similar performance on the UCI datasets [4] [6] [7] [9]. 

As mentioned earlier, the BHC performs a clustering of the output classes based on 
their separability and hence the algorithm tends to generate hierarchies of classes that 
appeal intuitively. A few sample trees for the datasets are included in the Appendix. 
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Table 2. Percentage Accuracy across the different datasets 



No. 


Name 


BHC 


BHC-SVM 


ECOC-SVM 


# classifiers 
BHC / ECOC 


1. 


Glass 


59.2 

(7.79) 


66.59 

(6.2) 


63.7 

(6.14) 


6/31 


2. 


Yeast 


43.56* 

(3.25) 


56.91 

(3.28) 


53.81 

(3.7) 


10/34 


3. 


Satimage 


83.85* 


91.45 


91.4 


5/31 


4. 


Segment 


43.9* 

(10.1) 


89.94 

(1.34) 


89.17 

(0.90) 


6/63 


5. 


Page-Blocks 


82.64* 

(5.66) 


94.5 

(2.05) 


94.8 

(1.9) 


4/15 


6. 


Pendigits 


89.30* 


97.9 


98.45* 


9/33 


7. 


Optdigits 


93.9* 


96.6 


96.99 


9/33 


8. 


Letter 


73.7* 

(1.13) 


96.23 

(0.60) 


96.83 

(0.32) 


25/15 


9. 


Vowel 


47.4* 


61.03 


51.29 


10/34 



The numbers at the internal nodes of the trees represent the meta-classes correspond- 
ing to these nodes. It can be seen that the BHC does indeed generate meaningful class 
hierarchies. For instance in the Satimage tree, the Vegetation/Foliage classes form one 
meta-class while the different soil classes form another. In some cases, it might be 
sufficient to just know whether an incoming data sample is one of vegetation or soil. 
A single classifier at the root node is capable of performing this task whereas in, say, 
the One-Vs.-All method one would have to evaluate six different binary classifiers to 
obtain the same result. Feature selection techniques can also be used to generate sim- 
ple classifiers that can distinguish between the meta-classes and one can identify 
those features that help distinguish classes at a coarser level of granularity. Such 
knowledge extraction is not easily possible in the other multiclassifier systems. 

A strength of ECOC based classifiers is that it can compensate for weak base clas- 
sifiers. Hence, one should also examine results based on lesser amounts of training 
data to see if ECOCs have superior properties in such situations. Eigures 1 and 2. 
show typical BHC-SVM and ECOC-SVM training curves obtained for two of the 
datasets. The fact that the BHC curves are still comparable to the ECOC-SVM based 
ones even at very low percentages of training data is noteworthy. 



4 Conclusions 

The empirical studies presented in this paper show that the Binary Hierarchical Clas- 
sifier is a very viable approach to multiclass problems. As compared to ECOC, the 
BHC offers the added advantages of not requiring an optimal code matrix, while us- 
ing far fewer classifiers and automatically generating class hierarchies. While inde- 
pendence of the (meta)-classes is desired for success of the ECOC schemes, the BHC 
exploits the very dependence to generate a robust and scalable classifier with several 
beneficial properties. 
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Fig. 1. Training curves for the Page-Blocks Dataset 




Fig. 2. Training curves for the Letters Dataset 
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Appendix 

A few of the sample trees obtained using the BHC: 



SATIMAGE 




Cotton Crop Soil with Red Soil 3,4,6 
Vegetation 
Stubble 

Grey Soil 4,6 




Dan^ Grey Soil Very Datr^) Grey Soil 





GLASS 




Containers Building window-float processed(BW-FP) 

Headlamps Vehidewindow-floatprocessed(VW-FP) 

Bull ding window-non-float processed(BW-NFP) 




Containers Headlan^is Tableware BW-FP.VW-FP, BW-NFP 




BW-NFP VW-FP 




VW-FP BW-FP 



VOWEL 




1,2,3, 4 5,6,7,8,9,10,11 




3,4 1,2 8,9,10 5,6,7,11 




head had heed hid who’d 8,9 6,11 5,7 

hood hoard heard hud hard hod 
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Abstract. One of the potential advantages of multiple classifier sys- 
tems is an increased robustness to noise and other imperfections in data. 
Previous experiments on classification noise have shown that bagging is 
fairly robust but that boosting is quite sensitive. Decorate is a recently 
introduced ensemble method that constructs diverse committees using 
artificial data. It has been shown to generally outperform both boosting 
and bagging when training data is limited. This paper compares the sen- 
sitivity of bagging, boosting, and Decorate to three types of imperfect 
data: missing features, classification noise, and feature noise. For miss- 
ing data. Decorate is the most robust. For classification noise, bagging 
and Decorate are both robust, with bagging being slightly better than 
Decorate, while boosting is quite sensitive. For feature noise, all of the 
ensemble methods increase the resilience of the base classiher. 



1 Introduction 

In addition to their many other advantages, multiple-classifier systems hold the 
promise of developing learning methods that are robust in the presence of im- 
perfections in the data; in terms of missing features, and noise in both the class 
labels and the features. Noisy training data tends to increase the variance in 
the results produced by a given classifier; however, by learning a committee of 
hypotheses and combining their decisions, this variance can be reduced. In par- 
ticular, variance-reducing methods such as Bagging [2] have been shown to be 
robust in the presence of fairly high levels of noise, and can even benefit from 
low levels of noise [3]. 

Bagging is a fairly simple ensemble method which is generally outperformed 
by more sophisticated techniques such as AdaBoost [4,13]. However, Ad- 
aBoost has a tendency to overfit when there is significant noise in the training 
data, preventing it from learning an effective ensemble [3]. Therefore, there is 
a need for a general ensemble meta-learner^ that is at least as accurate as Ad- 
aBoost when there is little or no noise, but is more robust to higher levels of 
random error in the training data. 

Decorate [9, 10] is a recently introduced ensemble meta-learner that di- 
rectly constructs diverse committees by employing specially-constructed artificial 

^ An ensemble meta-learner, like Bagging and AdaBoost, takes an arbitary base 
learner and uses it to build a more effective committee of hypotheses [17]. 



F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 293—302, 2004. 
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training examples. Extensive experiments have demonstrated that Decorate 
constructs more accurate diverse ensembles than AdaBoost and Bagging when 
training data is limited, and does at least as well as AdaBoost when the train- 
ing set is relatively large. By using artificial training data to construct diverse 
committees and prevent over-fitting, Decorate has been shown to be a very 
effective ensemble meta-learner on a wide variety of data sets. 

This paper explores the resilience of Decorate to the various forms of im- 
perfections in data. In our experiments, the training data is corrupted with 
missing features, and random errors in the values of both the category and the 
features. Results on a variety of UCI data demonstrate that, in general. Dec- 
orate continues to improve on the accuracy of the base learner, despite the 
presence of each of the three forms of imperfections. Furthermore, Decorate 
is clearly more robust to missing features than the other ensemble methods. 



2 The Decorate Algorithm 

This section summarizes the Decorate algorithm; for further details see [9, 10]. 
The approach is motivated by the fact that combining the outputs of multiple 
classifiers is only useful if they disagree on some inputs [6]. We refer to the 
amount of disagreement as the diversity of the ensemble, which we measure 
as the probability that a random ensemble member’s prediction on a random 
example will disagree with the prediction of the complete ensemble. 

Decorate was designed to use additional artificially-generated training data 
in order to generate highly diverse ensembles. An ensemble is generated itera- 
tively, learning one new classifier at each iteration and adding it to the current 
ensemble. The ensemble is initialized with the classifier trained on the given 
data. The classifiers in each successive iteration are trained on the original data 
and also on some artificial data. In each iteration, a specified number of artificial 
training examples are generated based on a simple model of the data distribu- 
tion. The category labels for these artificially generated training examples are 
chosen so as to differ maximally from the current ensemble’s predictions. We 
refer to this artificial training set as the diversity data. We train a new classifier 
on the union of the original training data and the diversity data. If adding this 
new classifier to the current ensemble increases the ensemble training error, then 
this classifier is rejected, else it is added to the current ensemble. This process 
it repeated until the desired committee size is reached or a maximum number of 
iterations is exceeded. 

The artificial data is constructed by randomly generating examples using an 
approximation of the training data distribution. For numeric attributes, a Gaus- 
sian distribution is determined by estimating the mean and standard deviation 
of the training set. For nominal attributes, the probability of occurrence of each 
distinct value is determined using Laplace estimates from the training data. Ex- 
amples are then generated by randomly picking values for each feature based on 
these distributions, assuming attribute independence. In each iteration, the arti- 
ficially generated examples are labeled based on the current ensemble. Given an 
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example, we compute the class membership probabilities predicted by the cur- 
rent ensemble, replacing zero probabilities with a small e for smoothing. Labels 
are then sampled from this distribution, such that the probability of selecting a 
label is inversely proportional to the current ensemble’s predictions. 

3 Experimental Evaluation 

3.1 Methodology 

Three sets of experiments were conducted in order to compare the performance 
of AdaBoost, Bagging, Decorate, and the base classifier, J48 under varying 
amounts of three types of imperfections in the data: 

1. Missing features: To introduce N% missing features to a data set of D 

instances, each of which has F features (excluding the class label), we select 
randomly with replacement instances and for each of them delete the 

value of a randomly chosen feature. Missing features were introduced to both 
the training and testing sets. 

2. Classification noise: To introduce N% classification noise to a data set 
of D instances, we randomly select instances with replacement and 
change their class labels to one of the other values chosen randomly with 
equal probability. Classification noise was introduced only to the training 
set and not to the test set. 

3. Feature noise: To introduce N% feature noise to a data set of D instances, 

each of which has F features (excluding the class label), we randomly select 
with replacement instances and for each of them we change the value 

of a randomly selected feature. For nominal features, the new value is chosen 
randomly with equal probability from the set of all possible values. For 
numeric features, the new value is generated from a Normal distribution 
defined by the mean and the standard deviation of the given feature, which 
are estimated from the data set. Feature noise was introduced to both the 
training and testing sets. 

In each set of experiments, AdaBoost, Bagging, Decorate, and J48 were 
compared on 11 UCI data sets using the Weka implementations of these methods 
[17]. Table 1 presents some statistics about the data sets. The target ensemble 
size of the first three methods was set to 15. In the case of Decorate, this 
size is only an upper bound on the size of the ensemble, and the algorithm 
may terminate with a smaller ensemble if the number of iterations exceeds the 
maximum limit. As in [9], this maximum limit was set to 50 iterations, and the 
number of artificially generated examples was equal to the training set size. 

To ascertain that no ensemble method was being disadvantaged by the small 
ensemble size, we ran additional experiments on some datasets with the ensemble 
size set to 100. The trends of the results are similar to those with ensembles of 

J48 is a Java implementation of C4.5 [12] introduced in [17]. 
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size 15. Details of these experiments are omitted here, but can be found in the 
extended version of this paper [11]. 

For each set of experiments, the performance of each of the learners was 
evaluated at increasing noise levels from 0% to 40% at 5% intervals using 10 
complete 10-fold cross validations. In each 10-fold cross validation the data is 
partitioned into 10 subsets of equal size and the results are averaged over 10 runs. 
In each run, a distinct subset is used for testing, while the remaining instances 
are provided as training data. 

To compare two learning algorithms across all domains we employ the statis- 
tics used in [16], namely the significant win/draw/loss record and the geometric 
mean error ratio. The win/draw/loss record presents three values, the number 
of data sets for which algorithm A obtained better, equal, or worse performance 
than algorithm B with respect to classification accuracy. A win or loss is only 
counted if the difference in accuracy is determined to be significant at the 0.05 
level by a paired t-test. 

The geometric mean (GM) error ratio is defined as 1^11"= i where Ea 

and Eb are the mean errors of algorithm A and B on the same domain. If the 
geometric mean error ratio is less than one it implies that algorithm A performs 
better than B, and vice versa. We compute error ratios to capture the degree to 
which algorithms out-perform each other in win or loss outcomes. 

Table 1. Summary of Data Sets 



Name 


Cases 


Classes 


Attri 

Numeric 


butes 

Nominal 


autos 


205 


6 


15 


10 


balance-scale 


625 


3 


4 


- 


breast-w 


699 


2 


9 


- 


colic 


368 


2 


10 


12 


credit-a 


690 


2 


6 


9 


glass 


214 


6 


9 


- 


heart-c 


303 


2 


8 


5 


hepatitis 


155 


2 


6 


13 


iris 


150 


3 


4 


- 


labor 


57 


2 


8 


8 


lymph 


148 


4 


- 


18 



3.2 Results 

In this section, we only present the statistics summarized over all 11 datasets. 
For detailed tables of results, see the extended version of this paper [11]. 

Missing Features: The results of running the algorithms when missing features 
are introduced, are presented in Tables 2-4. Each table compares the accuracy 
of Decorate versus another algorithm for increasing percentages of missing 
features. 
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Table 2. Missing Features: Decorate vs J48 



Noise Level % 


0 


5 


10 


15 


1 20 


25 


30 


35 


40 


Sig. W/D/L 
GM Error Ratio 


8/3/0 

0.8286 


10/1/0 

0.7882 


10/1/0 

0.7877 


WVol 

0.7815 


11/0/0 
1 0.7921 


11/0/0 

0.8039 


11/0/0 

0.8004 


11/0/0 

0.8095 


11/0/0 

0.8047 



Table 3. Missing Features: Decorate vs Bagging 



Noise Level % 


0 


5 


10 


15 


20 


25 


30 


35 


40 


Sig. W/D/L 
GM Error Ratio 


2/7/2 

0.9520 


2/8/1 

0.9298 


4/7/0 

0.9201 


3/7/1 

0.9177 


5/5/1 

0.9041 


4/7/0 

0.9083 


4/7/0 

0.9085 


5/5/1 

0.9150 


8/3/0 

0.8882 



Table 4. Missing Features: Decorate vs AdaBoost 



Noise Level % 


0 


5 


10 


15 


20 


25 


30 


35 


40 


Sig. W/D/L 
GM Error Ratio 


4/4/3 

0.9534 


5/4/2 

0.9382 


6/4/1 

0.9197 


4/6/1 

0.9024 


4/7/0 

0.9109 


6/5/0 

0.8982 


8/3/0 

0.8827 


6/5/0 

0.8968 


8/3/0 

0.8876 



These results demonstrate that Decorate is fairly robust to missing fea- 
tures, consistently beating the base learner, J48, at all noise levels (Table 2). 
In fact, when the amount of missing features is 20% or higher. Decorate pro- 
duces statistically significant wins over J48 on all datasets. The amount of error 
reduction produced by using Decorate is also considerable, as is shown by the 
mean error ratios. 

For this kind of imperfection in the data, in general, all of the ensemble 
methods produce some increase in accuracy over the base learner. However, the 
improvements brought about by using Decorate are higher than those caused 
by both Bagging and AdaBoost. The amount of error reduction achieved by 
Decorate also increases with greater amounts of missing features; as is clearly 
demonstrated by the GM error ratios. 

Figure 1(a) shows the results on a dataset that clearly demonstrates Deco- 
rate’s superior performance at all levels of missing features. In Figure 1(b), we 
see a dataset on which AdaBoost has the best performance when there are no 
missing features; but with increasing amounts of missing features, both Bagging 
and Decorate outperform it. 

The superior performance of Decorate could be attributed to the fact that 
it adds artificial examples to the training set. These artificial examples do not 
contain any missing features, and are generated based on the distributions of 
features estimated over the visible (non-missing) values. A thorough analysis of 
how using artificial examples can increase robustness to missing features is an 
important subject for future research. 

Classification Noise: The comparison of each ensemble method with the base 
learner, in the presence of classification noise are summarized in Tables 5-7. The 
tables provide summary statistics, as described above, for each of the noise levels 
considered. 
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Percentage of Missing Features Percentage of Missing Features 

(a) Iris (b) Autos 

Fig. 1. Missing Features 



The win/draw/loss records indicate that, both Bagging and Decorate con- 
sistently outperform the base learner on most of the datasets at almost all noise 
levels; demonstrating that both are quite robust to classification noise. In the 
range of 10-35% of classification noise, Bagging performs a little better than 
Decorate, as is seen from the error ratios. This is because, occasionally, the 
addition of noise helps Bagging, as was also observed in [3]. 

Unlike, Bagging and Decorate, AdaBoost is very sensitive to noise in 
classifications. Though AdaBoost significantly outperforms J48 on 7 of the 11 
datasets in the absence of noise, its performance degrades rapidly at noise levels 
as low as 10%. With 35-40% noise, AdaBoost performs significantly worse 
that the base learner on 7 of the datasets. Our results on the performance of 
AdaBoost agree with previously published studies [3,1,7]. As pointed out in 
these studies, AdaBoost degrades in performance because it tends to place a 
lot of weight on the noisy examples. 

Figure 2(a) shows a dataset on which Decorate has a clear advantage over 
other methods, at all levels of noise. Figure 2(b) presents a dataset on which 
Bagging outperforms the other methods at most noise levels. This figure also 
clearly demonstrates how rapidly the accuracy of AdaBoost can drop below 
that of the base learner. These results confirm that, in domains with noise in 
classifications, it is beneficial to use Decorate or Bagging, but detrimental to 
apply AdaBoost. 

Feature Noise: The results of running the algorithms with noise in the features 
are presented in Tables 8-10. Each table compares the accuracy of each ensemble 
method versus J48 for increasing amounts of feature noise. 

In most cases, all ensemble methods improve on the accuracy of the base 
learner, at all levels of feature noise. Bagging performs a little better than 
the other methods, in terms of significant wins according to the win/draw/loss 
record. In general, all systems degrade in performance with added feature noise. 
The drop in accuracy of the ensemble methods seems to mirror that of the base 
learner, as can be seen in Figure 3. The performance of the ensemble methods 
seems to be tied to how well the base learner deals with feature noise. 
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Table 5. Class Noise: Decorate vs J48 



Noise Level % 


0 


5 


10 


15 


20 


25 


30 


35 


40 


Sig. W/D/L 
GM Error Ratio 


8/3/0 

0.8286 


7/4/0 

0.8398 


8/3/0 

0.8633 


8/1/2 

0.8734 


8/1/2 

0.8809 


6/3/2 

0.8960 


6/3/2 

0.9121 


7/2/2 

0.9229 


8/1/2 

0.8995 



Table 6. Class Noise: Bagging vs J48 



Noise Level % 


0 


5 


10 


15 


20 


25 


30 


35 


40 


Sig. W/D/L 
GM Error Ratio 


7/4/0 

0.8704 


9/2/0 

0.8687 


9/2/0 

0.8526 


9/2/0 

0.8508 


8/3/0 

0.8443 


7/4/0 

0.8719 


8/2/1 

0.8867 


7/3/1 

0.8972 


7/3/1 

0.8995 



Table 7 . Class Noise: AdaBoost vs J48 



Noise Level % 


0 


5 


10 


15 


20 


25 


30 


35 


40 


Sig. W/D/L 
GM Error Ratio 


7/3/1 

0.8691 


6/1/4 

0.9930 


2/4/5 

1.0984 


1/5/5 

1.1604 


1/4/6 

1.2322 


2/2/7 

1.2242 


1/4/6 

1.2431 


1/3/7 

1.2120 


1/3/7 

1.1989 





Percentage of Classification Noise Percentage of Classification Noise 

(a) Labor (b) Breast-W 

Fig. 2. Classification Noise 

4 Related Work 



Several previous studies have focused on exploring the performance of various 
ensemble methods in the presence of noise. A thorough comparison of Bagging, 
AdaBoost, and Randomization (a method for building a committee of decision 
trees, which randomly determine the split at each internal tree node) is presented 
in [3]. This study concludes that while AdaBoost outperforms Bagging and 
Randomization in settings where there is no noise, it performs significantly worse 
when classification noise is introduced. 

Other studies have reached similar conclusions about AdaBoost [1,7], and 
several variations of AdaBoost have been developed to address this issue. For 
example, Kalai and Servedio [5] present a new boosting algorithm and prove 
that it can attain arbitrary accuracy when classification noise is present. An- 
other algorithm, Smooth Boosting, that is proven to tolerate a combination of 
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Table 8. Feature Noise: Decorate vs J48 



Noise Level % 


0 
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15 
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Sig. W/D/L 
GM Error Ratio 


8/3/0 

0.8286 


7/4/0 

0.8335 


7/4/0 

0.8434 


8/3/0 

0.8329 


7/3/1 

0.8593 


7/4/0 

0.8554 


6/5/0 

0.8690 


7/4/0 

0.8723 


6/5/0 

0.8782 



Table 9. Feature Noise: Bagging vs. J48 



Noise Level % 
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10/1/0 

0.8496 


10/1/0 
0.8473 1 


8/3/0 

0.8627 


10/1/0 

0.8634 
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Table 10. Feature Noise: AdaBoost vs. J48 



Noise Level % 
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15 
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25 
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35 
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Sig. W/D/L 
GM Error Ratio 


7/3/1 

0.8691 


7/2/2 

0.8449 


8/2/1 

0.8575 


7/3/1 

0.8455 


8/1/2 

0.8463 


8/1/2 

0.8564 


8/2/1 

0.8830 


7/4/0 

0.8900 


8/2/1 

0.8750 





Percentage of Feature Noise Percentage of Feature Noise 

(a) Glass (b) Iris 

Fig. 3. Feature Noise 



classification and feature noise is presented in [14]. McDonald et al. [8] compare 
AdaBoost to two other boosting algorithms - LogitBoost and BrownBoost - 
and conclude that BrownBoost is quite robust to noise. In an earlier study an 
extension to BrownBoost for multi-class problems was presented and shown em- 
pirically to outperform AdaBoost on noisy data [7]. However, BrownBoost ’s 
drawback is that it requires a time-out parameter to be set, which can be done 
only if the user can estimate the level of noise. 

5 Future Work 

Noise in training data chiefly contributes to an increase in the error due to 
variance of the base learner; and hence, variance-reduction techniques would 
be ideal to combat such noise. Bagging is a very effective variance reduction 
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method; whereas AdaBoost is primarily a bias reduction technique, though 
empirically it has shows to produce some reduction in variance as well [16]. In 
general, Decorate also produces significantly more accurate classifiers than 
the base learner. We are currently investigating whether this improvement in 
accuracy is mainly due to a reduction in bias or variance. This should lend some 
more insight into Decorate’s resilience to imperfections in data. 

In our study, all the ensemble methods were used to generate committees of 
size 15. It may be beneficial to generate larger ensembles, so that the difference 
in performance between the systems is more pronounced. 

An interesting avenue for future work would be to compare the performance 
of Decorate and Bagging with the new boosting algorithms mentioned in Sec- 
tion 4. Another interesting subject for future experimentation is testing how the 
ensemble methods discussed in this study compare to noise elimination tech- 
niques such as the ones presented in [15]. 

6 Conclusion 

This paper evaluates the performance of three ensemble methods. Bagging, Ad- 
aBoost and Decorate, in the presence of different kinds of imperfections in 
the data. Experiments using J48 as the base learner, show that in the case of 
missing features Decorate significantly outperforms the other approaches. In 
the case of classification noise, both Decorate and Bagging are effective at de- 
creasing the error of the base learner; whereas AdaBoost degrades rapidly in 
performance, often performing worse than J48. In general. Bagging performs the 
best at combating high amounts of classification noise. In the presence of noise 
in the features, all ensemble methods produce consistent improvements over the 
base learner. 
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Abstract. We utilise the techniques of independent component analysis 
and principle component analysis to derive an independent set of gestural 
primitives for visual sign-language, employing existing sign linguistics as 
a reference point in the feature reduction. 

In this way it is possible both to reduce (by several orders of magnitude) 
the requisite quantity of HMM computation involved in word classifi- 
cation, as well as to significantly improve performance through having 
transformed the initial classification problem into one of decision fusion. 
Moreover, the independent and optimally-compact representation of the 
gestural primitives ensures a maximum of classifier diversity prior to 
combination. 



1 Introduction 

The problem of sign language recognition has engendered considerable interest 
within the pattern-recognition/machine- learning community over recent years 
[eg 1-4], predominantly as a consequence of its unique interweaving of syntactic 
and image-processing concerns. Our interest in the problem falls both within 
these terms, as well as within the broader context of cognitive visual systems; 
the attempt, in essence, to ‘reverse engineer’ the human visual system. To this 
end, we are particularly interested in the nature of the feature-set most appro- 
priate to gesture recognition in so far as it reflects the signer’s intention, as 
much as we are concerned with optimising the classification performance of the 
recognition system. It is evident, however, the two goals are in no way mutually 
exclusive: indeed by encoding within the features’ structure the way in which 
sign-language gestures are visually transcribed, we might expect to achieve a sig- 
nificant improvement in both the dimensional requirements of our pattern-space 
as well in the diversity and generalisability of the classifiers contained therein. 

This design philosophy is then in contrast to the ‘image-processing-led’ ap- 
proaches characterised by eg [1] and [3], that employ very large sets of well under- 
stood feature primitives (for instance the quantised position vectors and hand- 
shape descriptors of Starner and Pentland [1]). Such methods can, of course, be 
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extremely effective, and indeed are broadly representative of the way in which 
pattern recognition is generally carried out; feature selection and classification 
being the ways in which the total body of problem information contained within 
the features are made usefully generalisable. However, much of the computa- 
tional effort, danger of overclassification and loss of interpret ability involved in 
this process might be avoided if there exists exploitable prior knowledge of ges- 
tural intention that can be straightforwardly encompassed by the structure of 
the features. 

Fortunately, we have access to just such information in the field of gesture 
recognition by virtue of the existence of sign language dictionaries, which have 
evolved over a significant number of years as a means both of visually teaching 
new signs to the sign language user, as well as referencing unfamiliar signs from 
sequences of gestural primitives (see for instance [9]). Thus the diagrammatic 
dictionary descriptions represent, in a sense, both the ground truth of gestural 
intention behind signed word sequences, as well as the gesture descriptor’s per- 
ceptual medians, these ‘visemes’ having been progressively developed for ease of 
recognition and compactness of description. Such descriptors are thus inherently 
independent of those issues, such as signing size and speed, that are irrelevant to 
gestural meaning, but which tend, nevertheless, to be intrinsically represented 
by feature-sets deriving from classical motion and image processing techniques. 

It is then the individual and combined characteristics of these fundamental 
or ‘intentional’ components of gesture which we shall seek to capture in the 
following paper, the first aspect of which will be to provide a temporally-ordered 
series of binary descriptors denoting the presence or absence of the respective 
gestural viseme components for each word class. Once these are obtained, we 
will set out to reduce the redundancy inherent to the visual sign descriptions by 
establishing the underlying independent components of description, in effect the 
minimal description of gestural signification in information-theoretic terms. We 
do this via a combination of independent component analysis (ICA) and principle 
component analysis (PCA), the use of which is made uniquely possible by the 
partial dependency reduction already inherent in the dictionary descriptors. 

It is hence after the independent temporal channels are established for each 
of the classes that the most important practical consequence of our technique 
makes itself apparent: if we are to go on to represent the temporal class sequences 
by Hidden Markov Models (HMMs), the independence of the feature channels 
implies that we can consider them on an individual basis. That is, rather than 
requiring that a single HMM encompass all possible transitions between obser- 
vational states deriving from combinations of binary feature states (totalling 2” 
for n channels) , we need only consider n two-state HMMs per class, which, being 
independent, have their class likelihood statistics combined via multiplication. 
(Which is to say, the n HMMs require a decision fusion rule equivalent to the 
Product rule: however, we adopt the Sum Rule, and log-likelihood statistics for 
reason of maximal error robustness; see for instance [12]. In an error-free sce- 
nario these approaches are, of course, equivalent). Clearly, this transformation 
of a classification problem into a decision-fusion problem represents a very sig- 
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nificant reduction in computational complexity: of the order of 7 x 10^° ^ 72 
computational cycles for typical practical scenarios^. 

Another immediately beneficial consequence of the application of ICA is that, 
in terms of the consequent orthogonalisation of the individual HMM outputs, 
the classifier diversity significantly improves: an attempt to quantify this effect 
will thus constitute the initial experimental component of this paper. A third 
benefit derives from applying PC A in conjunction with ICA: it becomes possi- 
ble to reduce the problem dimensionality through a prior determination of the 
actual number of underlying independent factors relevant to classification. In 
the outlined experimental scenario utilising this option will act to reduce the 
feature-space dimensionality from 21 to 18, a 17 percent improvement in the 
immediate processing requirements. 

In essence, then, the utilisation of ICA and PCA enables us to make maxi- 
mally efficient use of the feature ‘bandwidth’ for the purposes of classification, 
the resultant features, it is anticipated, corresponding to the basic components 
of gestural signification. Being independent, we are thus required to re-evaluate 
the problem as one of fusion of maximally diverse classifier outputs. 

In terms of organisation, the paper will hence commence with a description 
of the dictionary-derived feature-set, along with a precis of the underlying moti- 
vation for this type of viseme-based approach. Following this is a brief treatment 
of ICA analysis, comparing and contrasting it with its more ubiquitous relation, 
PCA. Thereafter we will outline a simple experiment to quantify the relative util- 
ity of a composite ICA/PCA treatment in terms of the effect on class separation 
in the HMM output space, as well as on the overall classification performance 
in the context of Sum-Rule decision fusion. 



2 Viseme-Based Feature Descriptors: 

Utilising a Pre-existing Pictorial Grammar 

2.1 Nature of Features 

We are, in the current investigation, concerned specifically with British Sign 
Language (BSL), which is classed by linguists as a topic-comment language (for 
instance, the English query ‘What is your name?’ would be most nearly ren- 
dered in BSL as the syntactic sequence ‘Name: you: what’). To this specifically 
gesture-syntactic language-form there is added a parallel finger-spelling com- 
ponent for exact English transliteration. Our long term goal being to provide 
a machine-translation that captures both of these aspects of the language, we 

^ Obviously, an ejjective treatment of the composite binary feature need not be as 
complex as indicated: non-ergodicity and vector-quantisation can be exploited to 
significantly decrease the computational work-load. However, it would necessarily 
still be several orders of magnitude greater than that implied by the independent 
approach. Furthermore, since much of the class variability derives from minor asyn- 
chronicities between the feature channels, a fully independent treatment means that 
significantly fewer ‘overlap’ transition states need be quantified. 
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require that any proposed gestural feature-set must lend itself to treatment in 
terms of grammarial context as well as letter-by-letter or word-by-word classi- 
fication (it is, for instance, the case in BSL that facial expression can act as a 
modifier of hand gesture, or that differing positions of signing relative to the 
body can indicate third-person narrative). Thus it is necessary, if we are to re- 
tain this extensibility, to isolate independent ‘gestural meaning’ components as 
far as possible, an objective that will ultimately motivate our use of ICA/PCA 
to extract the absolutely basic gesture features for classification. 

Our approach, then, attempts to describe events via a set of 2-D lexical 
features, a key rational being that if the features already naturally generalise 
about events within the image, then very much less training data is required to 
provide accurate classification. In a manner somewhat akin to Volger et aFs [11] 
we thus attempt to describe the constituents of sign at a component level, via 
a decomposition of the various gesture sequences into visemes (the constituent 
visual components of sign). The actual constitution of these visemes is derived 
from linguistical studies of sign and the consequent notation systems they have 
evolved to catalogue sign vocabularies. Volger et al thus sought to recognise 
the 22 component phonemes of American sign-language dictionary notation by 
utilising a distinct HMM for each phoneme in conjunction with a large set of 
training vectors. Our approach, on the other hand, attempts to coarsely describe 
sign events via a similar ‘Ha-Tab-Sig’ notation [9], but differs critically in that no 
prior training is required to determine feature membership, our rule-based body- 
motion descriptors being sufficiently robust to take these as given. Thus, in a 
sense, Volger et al's classification output is our feature-set input. The particular 
mechanism by which this body-feature description is enacted consists in the 
dynamic fitting of a 2D contour representing the head and shoulders. Specifically, 
the 18 connected points constituting the contour have their local edge strengths 
computed along normals in order to track the head and shoulder movement. From 
the position, scale and orientation of the contour we then estimate approximate 
location and sizes of the key body parts (for instance, the shoulders, hips, chest, 
stomach, forehead, chin etc.). 

We thus have a method of describing the position and motion of a signer’s 
gestures which is largely independent of individual body morphology and cam- 
era set-up. Crucially, it also provides a description that naturally generalises, 
consequently reducing the training requirements. Linguistic evidence further in- 
dicates that sign recognition is primarily performed upon the dominant hand 
(which conveys the majority of information), and consequently we currently dis- 
card the non dominant hand in order concatenate the HA, TAB, SIG features 
together to produce a 33 dimensional vector describing the viseme component 
of a single frame of video. 

2.2 Intelligibility Considerations 

While this dictionary-based feature-set has evolved towards some criterion of 
optimality through essentially heuristic reasoning acting over a number of years, 
it does not, as we have indicated, necessarily represent the most compact or 
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independent set of features (being limited chiefly by physical co-articulation re- 
straints); hence our motivation for proposing ICA of the gestural time sequence, 
independence critically allowing us to utilise separate HMMs for distinct fea- 
ture channels. It does, however, represent a readily comprehensible feature set. 
Of some concern to us, therefore, is the question of what implication ICA has 
for the intelligibility of features so transformed: we should not, in particular, 
wish to gain computational efficiency at the expense of comprehensibility if the 
eventuality might be avoided. 

In assessing just how far intelligibility is conserved, it is of some advantage to 
have commenced with a readily-intelligible feature prior set to ICA transforma- 
tion: it would not be straightforward to derive a comprehensible gesture syntax 
from (say) a set of vector-quantised Voronoi cell transitions. In having obtained 
an initial feature set that broadly corresponds to the underlying visual compo- 
nents of gestural intention (such that, for instance, we could learn to emulate 
without having to refer to an actual signer), we have in fact nearly optimally iso- 
lated feature components in intelligibility terms. However, because this isolation 
has not been carried out on the basis of mutual independence, there exists an 
inevitable superfluity of description in the viseme-based feature-set, manifesting 
itself as a dependency among the gestural components of the type that ICA 
immediately removes: critically for our intelligibility concerns, however, it would 
do so explicitly in terms of the intelligible components. Thus, to give a purely 
illustrative example, if a gesture-dictionary feature encoding were to contain a 
time sequence of relative (x, y) positions for both right and left hands, and if it 
were to transpire that, throughout the entirety of the dictionary, there existed an 
exact mirror-symmetry between the left and right hand gestures for each word, 
then an ICA-I-PCA reduction of the original feature space coordinatisation: 

(Lefthanda;_position, Lefthandy_position, Righthand,j,_position, Righthandj^_positio„), 
would produce a set of three fully independent coordinates: 

(Hand's_Midpoint,j, position! Hand's_Midpointj^ position; Inter_hand_distance) . 

Of course, it is by no means guaranteed in general that the ICA-I-PCA reduc- 
tion would be so transparently comprehensible, but very often an intuitive un- 
derstanding of the transformed components can be forged by virtue of the in- 
telligibility inherent in the original component descriptors. Thus our newly dis- 
cretized classifier set broadly corresponds to comprehensible aspects of sign, and 
consequently, decision-fusion of their outputs is correlated with a visualisable 
distinction between alternative gestural patterns. 



3 ICA verses PCA for Gestural Analysis 

3.1 Description of ICA and PCA Methodologies 

Principle Component Analysis (PCA) has long been a tool in the pattern recog- 
nition researcher’s toolkit as a means of reducing pattern-space dimensionality 
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while retaining class separability. It does so by providing a linear transforma- 
tion of pattern vectors that maximises the retained variance with respect to the 
subspace dimensionality, thereby preserving class discriminant information and 
prioritising (via the eigenvector eigenvalues) the translated feature axes. 

As such, PCA methods will, in general, find truly independent (as opposed 
to decorrelated) feature axes for only a small subset of pattern-vector distribu- 
tions. ICA, in contrast, specifically seeks independent subspace orientations in 
the data without retaining variance information (components essentially being 
normalised). Thus ICA can be considered a factor analysis for multivariate data 
where the mixing system is initially unknown. As a tool for data interpretation, 
ICA is thus much more useful than PCA, and on occasion, capable of greatly 
improving the overall classification performance. 

We shall give a very brief overview of the former process as follows (with more 
detailed and general treatments being available in, for instance, [7] and [8]): Let 
t be a two-dimensional binary vector describing the state of the 21 individual 
feature channels over the complete temporal range, t. We wish to describe this 
vector in terms of the fully independent feature-set vector t' , of presently un- 
known channel size. The two vector quantities are related by a mixing matrix, 
M, thus: t' = Mt. The determination of the matrix M is thus the objective 
of our calculation, the first stage of which is the decorrelating (or ‘whitening’, 
‘sphering’) of the original input space. That is, we shall require a linear transfor- 
mation = Wt' such that the expectation is equal to / (/ being the 

identity matrix). A simple solution to this constraint exists via the expansion: 

I = = E{Wt'[Wt'f) = E{Wt't''^W'^) = E{W[t't'^]W'^) (1) 

Setting S = E{t't''^), we observe that W = fulfils the terms of the constraint, 
the last term in the equation becoming S^SS^{= I). Having found a suitable 
W, it only remains to perform a rotation of the whitened pattern-space axes 
such that the non-Gaussianity of the probability distribution of the individual 
variates of the transformed space is maximised (linear mixtures of variates being 
invariably more Gaussian than their components via the central limit theorem) . 
This is usually achieved via an appropriate statistical, information theoretic or 
morphological measure of non-Gaussianity, and any one of a number of algo- 
rithms for finding global maxima/minima. 

3.2 Implication of ICA and PCA for Classification 

The practical distinction between the two component analysis methods can, in 
the context of classification, be rather subtle, and in general the unmodified 
ICA approach is only found to be useful when distinguishing classes with a high 
degree of stochastic independence (for instance, the star/cosmic ray astronom- 
ical distinctions of [10]), or else where classes are composed of distinct sets of 
independent features (for instance, the texture classifications of [5], where the 
array of potential class sub-textures has a visual aspect somewhat akin to two- 
dimensional Fourier components). As a consequence of the normalising prop- 
erties of ICA detailed earlier, subtle feature independences can be very much 
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amplified: whether or not this is useful for classification, though, tends to be 
situation dependent. 

Thus, a heuristic rule for advocating the implementation of ICA over PCA for 
the purposes of classification, might read ‘employ PCA where class discriminant 
information is contained within small numbers of correlated features, but em- 
ploy ICA when class discriminant information is contained within independent 
features composed of large numbers of correlated components, and the isolation 
of those components is crucial to class interpretability’. Note it is also possible 
to perform PCA prior to ICA where we have reason to suppose the existence of 
fewer significant independent components than the pre-specified dimensionality 
of our feature-space by imposing an eigenvalue ordering on the ICA components. 

It is precisely the latter situation that we expect to present itself in relation 
to our dictionary-derived feature vectors: for instance (and here we very much 
simplify), it might transpire that left and right hand positions are highly corre- 
lated with arm and shoulder positions, while being completely independent of 
each other; or finger positions might be independent of each other, but corre- 
lated with hand positions, and so on. Thus the gestural channels through which 
information is conveyed (hands, fingers, etc), are physically bound to various 
other body articulations in a way that must of necessity be depicted in a sign 
dictionary, but which may in fact be superfluous to gesturing intent. By per- 
forming PCA as an initial thresholding mechanism and subsequent ICA, it is 
thus possible extract the minimal subset of gestural indicators. 

When the temporality of these gestural sequences is additionally considered, 
the independence of the components has a crucial bearing on how intelligibly the 
modifications of meaning imposed by grammatical context are encompassed: fully 
independent components will contain only single context modifiers in individual 
channels (a fact exploited through our combination of HMM channel likelihoods 
in such a way that additional context information can be added as the technique 
develops) . 



4 Experimental Implementation and Findings 

In deriving the ICA transformation matrix for our initial feature data-set, we 
have thus firstly to concatenate all 232 word instances in the training data base 
(which would ideally consist of large number of actual word sequences, such that 
multiple word instances are represented at their correct relative probability of 
occurrence). Thus we obtain, for our limited case, a 21 x 3640 data matrix de- 
scribing the 3640 temporal states of the 21 feature ‘channels’ (there being an 
immediate redundancy of 9 features for the selected training vectors): see fig. 1 
for an indication of the format of the feature- vectors. ICA itself is carried out via 
the negentropy minimisation algorithm developed at the Helsinki University of 
Technology [6], which also performs a PCA-based assessment of feature redun- 
dancy. In this way, the resulting transformation matrix converts our original 21 
feature-channels to a series of 18 independent feature components. We shall seek 
to quantify the improvement in feature diversity implied by this transformation 
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Fig. 1. Untransformed gestural feature sequence for first 12 words in the database 




Fig. 2. Transformed gestural feature sequence for first 12 words in the database with 
grey scale values in the range —2 : 2 (note increase in information density over fig. 1) 



of temporal feature information in terms of the average Mahalanobis distance 
between classifier outputs in the transformed and untransformed classes, re- 
spectively. The actual attribution of class probabilities is carried-out via HMM 
modelling in the following manner: The post-ICA feature-set (figure 2) now con- 
tains 18 independent temporal feature sequences, with a continuous range of 
values in the interval —2 : 2 (recall the the features were formerly quantised as 
the binary digits 1 and 0, representing presence and absence respectively). It 
is consequently necessary, in producing a meaningful comparison between the 
classification abilities of the ICA-transformed features and those that have not 
undergone the process, that we represent the former in a similarly quantised 
fashion (that is, with an equivalent number of hidden and observable HMM 
states). The number of observables in the untransformed feature-set being two 
thus requires that we map the continuous interval (—2 : 2) to the discrete dig- 
its 0 and 1: we choose to do this in the most symmetric fashion by allocating 
negative values to 0 and positive values to 1 (which would, of course, imply a 
loss of information, such that the untransformed pattern vectors could not now 
be recovered from the ICA features: in general this is not a conceptually seri- 
ous omission, being roughly the equivalent of retaining phase information at the 
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expense of amplitude information in a Fourier transformation, which has only 
marginal consequences for signal recovery unlike its converse) . 

We have now to select the number of hidden states per HMM required to 
describe the range of behaviours implicit in the transformed and untransformed 
features. Empirical evidence suggests that four are sufficient. Thus for the 18 
temporal channels of the transformed data and the 21 channels of the untrans- 
formed data, we train a series of HMMs of 4 hidden and 2 observational states 
each, one for each of the classes in order to provide a comparable test environ- 
ment for the transformed and untransformed data. 

To allocate a class probability to a particular observed gesture sequence, we 
have therefore to utilise the multiple log-likelihood outputs of the trained class 
HMMs; respectively 18 or 21 separate HMMs per class. Uniquely for the ICA- 
transformed features it is possible to sum these channel likelihoods together to 
obtain a class probability. With respect to the untransformed features, however, 
we may still sum these likelihoods together to obtain an overall measure of class 
likelihood, but the lack of independence implies that this is no longer strictly a 
probability. (The only truly stochastic way to treat the untransformed channels, 
would be, as indicated earlier, to concatenate the channels together to repre- 
sent a single observable state, requiring a single HMM of 2^^ observable states 
(assuming ergodicity) rather than the 21 HMMs of just 2 observable states em- 
ployed by our comparison statistic; clearly a computational impossibility). It is 
thus only the prior application of ICA that requires the use of decision fusion. 

Having established a comparable performance measure for both the trans- 
formed and untransformed classes, we can, prior to computation of these quan- 
tities, obtain a preliminary quantification of the relative value of the two ap- 
proaches by considering the ensemble average inter-class Mahalanobis distance 
between every pair of class centroids in the total probability space deriving from 
the various HMM likelihoods. 

In our limited training set, for which we have a total of 3640 pattern vectors 
representing 115 distinct classes, the Mahalanobis distance is thus computed 
from an average of 31.7 pattern vectors ensemble-averaged over = 6555 

possible class pairs: the mean ensemble distances so derived are 4.7424 for the 
transformed and 0.8583 for the untransformed features. Thus we have, aside 
from the other benefits achieved, obtained a 6-fold improvement in the class 
differentiability as a consequence of the ICA/PCA transformation. The result of 
Sum-Rule decision fusion for the transformed and untransformed data reflects 
this discrepancy at a more modest level, where we obtain classification rates of 
63.2% and 59.0%, respectively. Clearly a more complex decision fusion scheme 
such as a weighted summation or clustering approach could potentially improve 
upon this, and remains for future work. 

One caveat that we should issue in relation to this preliminary performance 
assessment is that the word lengths, being highly variable and subject to ‘nest- 
ing’ difficulties, require that a range of windows of varying temporal length be 
simultaneously considered in order to return a probability statistic for all of the 
classes. Thus, we are still required to perform classical Viterbi sorting in order 
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to perform real-time sign-language processing. In future publications we shall 
consequently aim to provide a more detailed performance study of the implica- 
tions of PC A and ICA methods for gesture-recognition with this consideration 
taken explicitly into account. We do not, however, expect that such contextual 
issues will significantly modify our findings with regard to class-separability: the 
existing argument for ICA in terms of the reduction in HMM computational 
requirement is, of course, completely unaffected. 

5 Conclusions 

We have instigated a programme to isolate the fundamental components of ges- 
tural intent by referring to existing sign dictionaries for the construction of an 
appropriate feature-set. To eliminate redundancies in this dictionary-based de- 
scription, we have further employed the techniques of ICA and PCA to derive a 
minimal set of independent gestural components. In doing so, we have necessi- 
tated the implementation of a decision fusion framework, and consequently been 
able to reduce the HMM computational requirement for word-recognition by 
several orders of magnitude, as well to significantly enhance classifier diversity. 
Additionally, we have allowed for further extensibility of the feature set, and per- 
mitted grammatical context to be straightforwardly included within the system 
design, an endeavour that will, we anticipate, reach fruition in a future context- 
sensitive implementation of the viseme-based gesture recognition system. 
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Abstract. There are many examples of classification problems in the lit- 
erature where multiple classifier systems increase the performance over 
single classihers. Normally one of the following two approaches is used 
to create a multiple classifier system: 1. Several classifiers are developed 
completely independent of each other and combined in a last step. 2. 
Several classifiers are created out of one prototype classifier by using so 
called classifier ensemble methods. In this paper a novel algorithm which 
combines both approaches is introduced. This new algorithm is experi- 
mentally evaluated in the context of hidden Markov model (HMM) based 
handwritten word recognizers and compared to previously introduced 
methods which also combine both approaches. 

Keywords: Handwriting Recognition; Hidden Markov Model (HMM); 
Multiple Classifier System; Ensemble Method. 



1 Introduction 

The field of off-line handwriting recognition has been a topic of intensive research 
for many years. First only the recognition of isolated handwritten characters was 
investigated [25], but later whole words [24] were addressed. Most of the systems 
reported in the literature until today consider constrained recognition problems 
based on vocabularies from specific domains, e.g. the recognition of handwrit- 
ten check amounts [13] or postal addresses [14]. Free handwriting recognition, 
without domain specific constraints and large vocabularies, was addressed only 
recently in a few papers [15, 21]. The recognition rate of such systems is still low, 
and there is a need to improve it. 

The combination of multiple classifiers was shown to be suitable for improving 
the recognition performance in difficult classification problems [18,28]. Also in 
handwriting recognition, classifier combination has been applied. Examples are 
given in [2, 19, 29]. Recently new ensemble creation methods have been proposed 
in the field of machine learning, which generate an ensemble of classifiers from a 
single classifier [4]. Given a single classifier, the base classifier, a set of classifiers 
can be generated by changing the training set [3], the input features [11], the 
input data by injecting randomness [6], or the parameters and the architecture 
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of the classifier [22] . Another possibility is to change the classification task from 
a multi-class to many two-class problems [5] . Examples of widely used methods 
that change the training set are Bagging [3] and AdaBoost [7]. Random subspace 
method [11] is a well-known approach based on changing the input features. A 
summary of ensemble creation methods is provided in [4] . In the present paper 
we focus on Bagging, random subspace method and a version of AdaBoost for 
multi-class problems, AdaBoost. Ml [7]. 

One common feature of the ensemble creation methods discussed above is 
the fact that they all start from a single classifier to derive an ensemble. In [8] a 
more general approach was proposed where we initially consider a set of classifier 
prototypes and separately apply an ensemble method to each of the prototypes. 
The final ensemble is then constructed by fusing all ensembles. The contribution 
of the present paper is twofold. First, the ensemble methods introduced in [8] 
are applied in conjunction with significantly improved HMM-based recognizers. 
Second, a new algorithm for ensemble generation is introduced and evaluated. 
This algorithm takes explicit advantage of the diversity of prototype classifiers. 

The rest of this paper is organized as follows. In Section 2, the method for 
classifier generation, which starts from a set of prototypes, rather than a single 
base classifier, is described. In Section 3 the new algorithm which takes explicit 
advantage of the diversity of the prototype classifiers is described. The prototype 
classifiers for handwriting recognition used in the experiments are presented in 
Section 4. Then, in Section 5, results of experiments are reported. Finally, some 
conclusions are drawn in Section 6. 

2 Creation of Ensembles from Sets 
of Prototypical Classifiers 

The method used in this paper for creating ensembles from sets of prototypical 
classifiers was introduced in [8] and is shortly described in this section. The 
underlying idea is very simple. Rather than starting with a single classifier, as it 
is done, for example, in Bagging, AdaBoost and the random subspace method, 
we initially consider a set of classifiers (called prototypes in the following) and use 
an ensemble method to generate an ensemble out of each individual prototype. 
Then we merge all classifiers of these ensembles to get a single ensemble. For 
further details see [8]. 

An issue that needs to be addressed when implementing the method sketched 
before is the generation of the initial prototype classifiers Ci, . . . , C„. Sometimes 
different classifiers for the same task may already exist. In the experiment of 
this paper we use HMM classifiers with different architecture and different input 
features as prototypes. 

3 Multi-probabilistic Boosting 

The new algorithm is based on the simple probabilistic boosting (SPB) method 
introduced in [9]. The SPB algorithm works with a distribution d of weights 
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over the training set, similarly to AdaBoost [7]. Here we interpret the weight 
d{x) of an element x of the original training set T as the probability of being 
selected when sampling elements for the training set of the next classifier of 
the ensemble. The main idea of the SPB algorithm is to set the weight d{x) of 
a training element x proportional to the probability e(x) that the ensemble of 
classifiers will misclassify elements similar to element x. 

Similarly to AdaBoost, the weights of “hard” elements, i.e. elements which 
are likely to be misclassified, are set to a high value. It was decided to make 
the weights linearly dependent on the misclassification probability, because this 
approach is simple and the optimal function o : e{x) d{x) is unknown anyway. 
The question remains how to calculate e{x) from the results of the classifiers 
Cl, . . . ,Cm that were already created. Four different functions were considered 
in [9]. In this paper only one function, E(x), will be used^, as it produced the 
best results in previous experiments. The error function E(x) is defined in the 
following equation: 

f 0 if the voting result of the ensemble for x is correct 

= ( 1 ) 

In this equation k{x) is the number of correct classifiers for pattern x and m is 
the total number of classifiers in the ensemble. 

The novel multi probabilistic boosting (MPB)^ algorithm to be introduced in 
this paper, is an extension of SPB for the case of several prototypes. In contrast 
to the SPB algorithm, and also to AdaBoost, we have in MPB a distribution d 
of weights over the training set for each prototype. The algorithm produces one 
classifier for each prototype per step where the distribution corresponding to the 
actual prototype is used. The weight of the distribution of the f— th prototype 
is set according to the following equations 

(^di(x)) (X e(a;) (2) 

i 

P) 

In the equation, e(x) is the probability that the ensemble consisting of all clas- 
sifiers produced in the previous steps misclassifies x, and ej(x) is the probability 
of a wrong result for pattern x being produced by the ensemble consisting only 
of the classifiers produced from the i — th prototype. If ~ = Oj i-®- 

the error probability is 1 for all prototypes, then all di(x) are set to the same 
value (i.e. Vij(di(x) = dj(x))). In the experiments of this paper always the error 
function E(x) described above was used. 

In this algorithm the prototypes which are likely to correctly recognize ele- 
ment X receive higher weights than prototypes whose classifiers often misclassi- 
fied the pattern. The reason behind this kind of weight assignment is that we 

^ The function E(x) was denoted by 62 (x) in [9]. 

^ The first word ‘multi” refers to the fact that MPB works with multiple prototypes. 
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lexicon words character models 




Fig. 1. Concatenation of character models yields the word models 



want each prototype to “focus” on those elements for which its chances of suc- 
cessful recognition are high. We also notice that the weight is proportional to 
the error probability of the ensemble consisting of all classifiers produced in the 
previous steps. This means that the weight is especially high for patterns which 
are likely to be misclassified by the whole ensemble and for which the actual 
prototype is likely to classify them correctly. 

4 Handwritten Text Recognizer 

In Section 5 two sets of experiments are described. In the first set of experiments 
two HMM-based handwritten word recognizers are used. These two recognizers, 
which will be called C\ and C 2 in the following, are similar to the one described 
in [21]. We assume that each handwritten word input to the recognizers has been 
normalized with respect to slant, skew, baseline location and height (for details 
of the normalization procedures see [21]). A sliding window of one pixel width is 
moved from left to right over the word and nine geometric features are extracted 
at each position of the window. Thus an input word is converted into a sequence 
of feature vectors in a 9-dimensional feature space. After the extraction of a 
feature vector the window is shifted by one pixel, i.e. the number of extracted 
feature vectors is the same as the width of the word in pixels. The geometric 
features used in the system include the fraction of black pixels in the window, 
the center of gravity, and the second order moment. These features characterize 
the window from the global point of view. The other features give additional 
information. They represent the position of the upper- and lowermost pixel, 
the contour direction at the position of the upper- and lowermost pixel^, the 
number of black-to-white transitions in the window, and the fraction of black 
pixels between the upper- and lowermost black pixel. In [21] a more detailed 
description of the feature extraction procedures can be found. 

® To compute the contour direction, the windows to the left and to the right of the 
actual window are used. 



318 



Simon Gunter and Horst Bunke 



For each uppercase and lowercase character, an HMM is build. For all HMMs 
the linear topology is used, i.e. there are only two transitions per state, one to 
itself and one to the next state. To model entire words, the character models are 
concatenated with each other. Thus a recognition network is obtained (see Fig. 
1). This network exactly represents the set of words included in the underlying 
dictionary. Note that the network doesn’t include any contextual knowledge on 
the character level, i.e., the model of a character is independent of its left and 
right neighbor. There is exactly one model for each word from the underlying 
dictionary. This approach makes it possible to share training data across different 
words. That is, each word in the training set containing character x contributes 
to the training of the model of x. Thus the words in the training set are more 
intensively utilized than in the case where an individual model is build for each 
word as a whole, and characters are not shared across different models. One 
important advantage of using HMMs on the word level is that the segmentation 
of the words into characters is done automatically by the Viterbi recognition 
algorithm [23]. 

The feature distributions in each state of an HMM are modeled by single 
Gaussians and four iterations of the Baum- Welch algorithm [23] are used for 
the training of the classifiers. The two classifiers, C\ and G 2 , differ in the way 
the number of states is determined for each individual character. For Ci, the 
Quantile method [31] and for C 2 the Bakis method [31] was used. 

In the second set of experiments described in Section 5 two more sophisticated 
classifiers were used. The first classifier, C 3 , is an optimized version of classifier 
Cl. For classifier C 3 the distribution of the features in each state of an HMM 
is modeled by a Gaussian mixture instead of a single Gaussian. The training 
method of the classifier, which involves the determination of the number of 
Gaussians in each state and the number of training iterations, was optimized 
on a validation set, using a strategy described in [10]. The other classifier, C 4 , 
is a modified version of the classifier presented in [26,27]. Glassifier C 4 uses a 
sliding window for feature extraction were the window width is 16. Also for this 
classifier the window is shifted only by one pixel after the extraction of a feature 
vector. The window is partitioned into 16 cells arranged in a 4 x 4 grid. The 
average grey value of the pixels of each cell is used as a feature. A Karhunen- 
Loeve transformation [16] is then applied to the feature vectors and only the 
first 14 components of the transformed feature vectors are used. Glassifier C 4 
also uses a training method optimized by a strategy presented in [ 10 ], and models 
the distribution of the features by Gaussian mixtures. In addition the number 
of states in each HMM is optimized by the Quantile method introduced in [31]. 

The implementation of all systems is based on the Hidden Markov Model 
Toolkit (HTK), which was originally developed for speech recognition [30]. This 
software tool employs the Baum- Welch algorithm for training and the Viterbi 
algorithm for recognition [23]. The output of each HMM classifier is the word 
with the highest rank among all word models together with its score value. 
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5 Experiments 

For isolated character and digit recognition, a number of commonly used data- 
bases exist. However, for the task considered in this paper, there exists only one 
suitable database to the knowledge of the authors, holding a sufficiently large 
number of words produced by different writers [20] . Consequently this database 
was used in the experiments. 

Two sets of experiments were done. In the first set classifiers Ci and C 2 were 
used and the number of classifiers per ensemble was fixed to 10. In the second 
set of experiments classifiers C 3 and C 4 were used. For this set of experiments 
the optimal number of classifiers was determined in a separate experiment. 

To combine the individual classifiers of the ensembles, the following combi- 
nation schemes were applied: 

1. Voting {voting)'. Only the top choice of each classifier is considered. The 
word class that is most often on the first rank is the output of the combined 
classifier. Ties are broken by means of the maximum rule, which is only 
applied to the competing word classes. The maximum rule decides for the 
word class with the highest score among all word classes and all classifiers. 

2. Weighted voting {perf. v.): Here we consider again the top class of each clas- 
sifier. In contrast with regular voting, a weight is assigned to each classifier. 
The weight is equal to the classifier’s performance (i.e. recognition rate) on 
the training set. The output of the combined classifier is the word class that 
receives the largest sum of weights. 

3. GA weighted voting {ga v.): This combination scheme is similar to weighted 
voting, but the optimal weights are calculated by a genetic algorithm based 
on the results of the classifiers achieved on the training set. 

In the first set of experiments a data set of 10,927 words with a vocabulary of 
size 2,296 was used. That is, a classification problem with 2,296 different classes 
was considered. The total number of writers who contributed to this set is 81. 
Prototype classifiers, C\ and C 2 , as described in Section 4 were used. 

A training set containing 9,861 words and a test set containing 1,066 words 
were chosen in such a way that none of the writers of the test set were represented 
in the training set. So the experiments are writer independent. The recognition 
rate in this experiment was 70.92 % for prototype Ci, and 70.71 % for proto- 
type C 2 . The results of the first set of experiments are shown in Table 1. The 
ensemble method is indicated in the column algorithm. The entries in column C 
denote the classifiers used in the experiment. If there is only one classifier then 
the normal ensemble method was applied. If the entry contains both classifiers 
then the algorithm described in Section 2 was used. In this case five classifiers 
were generated from each prototype C\ and C 2 . The number of features for the 
random subspace method was set to six. 

First we focus on rows 1-9 in Table 1, i.e. all methods but MPB, are consid- 
ered. All ensemble methods using both prototypes outperform the corresponding 
algorithms using only one prototype for any of the considered combination rules 
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Table 1. Results of the ensemble methods. The recognition rate of the prototype 
classifiers Ci and Ci is 70.92 % and 70.71 %, respectively. 



algorithm 


C 


voting 


perf. 


V. 


ga. V. 


Bagging 


Cl 


71.11 


% 


71.2 


% 


70.82 


% 


Bagging 


C 2 


70.83 


% 


71.01 


% 


70.92 


% 


AdaBoost 


Cl 


72.23 


% 


72.33 


% 


72.23 


% 


AdaBoost 


C 2 


71.39 


% 


71.76 


% 


71.67 


% 


random subspace 


Cl 


71.29 


% 


71.01 


% 


71.29 


% 


random subspace 


C 2 


70.26 


% 


70.45 


% 


70.08 


% 


Bagging 


Ci,C2 


71.29 


■% 


71.67 


■% 


71.67 


■% 


AdaBoost 


Ci,C2 


72.8 


% 


72.51 


% 


72.7 


% 


random subspace 


Ci,C2 


71.86 


% 


71.95 


% 


71.39 


% 


MPB 


Ci,C2 


73.36 


■% 


73.08 


■% 


73.17 


■% 



(see rows 1-9). This means that in 6 out 6 cases the algorithm described in Sec- 
tion 2 produced results that are superior to the corresponding classic ensemble 
methods. When applying the sign test [12] the finding that the algorithm de- 
scribed in Section 2 is better is statistically significant using a significance level 
of 2%. This shows that the algorithm described in Section 2 takes advantage of 
the diversity of the prototypes. 

AdaBoost in conjunction with Ci,C2 (abbreviated as AdaBoost(C'i, C2) in 
the following) produced the best result out of rows 1 to 9. Considering also 
the last row, we notice that the novel algorithm, MPB, was better than Ada- 
Boost (Ci, C2) for any of the combination rules. MPB achieved in average a 
recognition rate that is 0.5 % higher than AdaBoost(Ci, C2). This shows that the 
MPB algorithm has the potential of improved performance over the algorithm 
described in Section 2. 

In the second set of experiments a training set of 18,920 words and a test 
set of 3,264 words were used. The vocabulary of the experiment contains 3,997 
words, i.e. a classification problem with 3,997 different classes is considered. The 
set of writers of the training set and the set of writers of the test set are disjoint. 
So the experiments are again writer independent. The total number of writers 
who contributed to this set is 153. Prototype classifiers C3 and C4, as described 
in Section 4, are used. The recognition rates of these classifiers are 80.36 % and 
71.57 %, respectively. 

As classifier C 4 uses transformed (and reduced) feature vectors, an appli- 
cation of the random subspace method is not suitable for this classifier. Yet 
Bagging and AdaBoost are applicable. 

For this set of experiments the optimal number of classifiers was determined 
in a separate experiment. The optimal number of classifiers was found to be 21 
for Bagging and 14 for AdaBoost. For the algorithm described in Section 2 we 
produced ensembles of similar size as for the corresponding classical method. 
In addition the same number of classifiers as AdaBoost, 14, was used for MPB. 
There are two reason for doing this. First, MPB and AdaBoost are quite similar 
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Table 2. Results of the second set of experiments. The recognition rate of prototype 
classifiers C 3 and C 4 is 80.36 % and 71.57 %, respectively. 



algorithm 


C 


voting 


perf. V. 


ga. V. 


Bagging 


C 3 


81.1 % 


80.91 % 


81.13 % 


Bagging 


C 4 


73.07 % 


73.04 % 


72.95 % 


AdaBoost 


C 3 


82.02 % 


81.89 % 


81.92 % 


AdaBoost 


C 4 


74.3 % 


73.9 % 


73.77 % 


random subspace 


C 3 


80.73 % 


80.58 % 


80.61 % 


Bagging 


C3,C4 


83 % 


83.64 % 


83.21 % 


AdaBoost 


C3,C4 


83.76 % 


83.79 % 


84.07 % 


MPB 


C3,C4 


83.79 % 


84.16 % 


84.16 % 



because they are both boosting algorithms. Second, by using the same number 
of classifiers, a more objective comparison is possible. 

The results of the second set of experiments are shown in Table 2 where the 
same notation as in Tables 1 is used. Again, we notice that all ensemble methods 
using both prototypes outperform the corresponding algorithms using only one 
prototype for any combination rule. To compare the ensemble methods more 
thoroughly, the results of the methods using several prototypes were compared 
to the results of the algorithms using prototype C3 with respect to statistical 
significance. It was observed that the superiority of the ensemble methods using 
both prototypes was statistically significant for all combination schemes using a 
significance level of 0.1 %. MPB again produced the best results. Although the 
superiority of MPB over AdaBoost(C'3, C4) is not statistically significant, it is an 
indication that MPB is capable of outperforming the AdaBoost version which 
uses several prototypes. 

Please note that all algorithms using both prototypes produced good results 
despite the fact that half of the classifiers were produced from prototype C4, 
which has a much lower performance than prototype C3. It seems that the in- 
crease of diversity by adding classifiers produced out of prototype C4 had a larger 
impact on the performance than the rather low individual performance of these 
additional classifiers. 

6 Conclusions 

In this paper, the generation of ensembles of classifiers from a set of prototype 
classifiers was studied. A new ensemble method, multi probabilistic boosting 
(MPB), has been proposed. The new ensemble method was experimentally eval- 
uated together with previously introduced ensemble methods using several pro- 
totypes in complex handwritten word recognition tasks. The ensemble methods 
using several prototypes were also compared to classical ensemble methods. 

The ensemble methods using two prototype classifiers were found to be su- 
perior to the ensemble methods using only one of the two prototype classifiers. 
The performance could be further improved by using the MPB method. These 
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findings were consistent for two independent data sets and for all of the three 
considered combination rules. 

In future research we will focus on the use of more than two prototype clas- 
sifiers. For example all four prototype classifiers Ci, C 2 , C 3 , C 4 described in this 
paper could be applied together. 
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Abstract. A serial multi-stage classification system for facing the problem of 
intrusion detection in computer networks is proposed. The whole decision proc- 
ess is organized into successive stages, each one using a set of features tailored 
for recognizing a specific attack category. All the stages employ suitable crite- 
ria for estimating the reliability of the performed classification, so that, in case 
of uncertainty, information related to a possible attack are only logged for fur- 
ther processing, without raising an alert for the system manager. This permits to 
reduce the number of false alarms. 

The proposed multi-stage intrusion detection system has been tested on two dif- 
ferent services (http and ftp) of a standard database used for benchmarking in- 
trusion detection systems. The experimental analysis highlights the effective- 
ness of the approach: the proposed system behaves significantly better than 
other multi-expert systems performing classification in a single stage. 



1 Introduction 

The increasing number of different services nowadays offered through the Internet 
determines a strong request for exploiting computer network security techniques that 
permit to protect Internet providers and/or commercial sites from malicious attacks 
(intrusions). A variety of approaches for facing the network intrusion detection prob- 
lem has been proposed until now [1]. This notwithstanding, an Intrusion Detection 
Systems (IDS) can be basically ascribed to two different categories. The first one, that 
exploits signatures of known attacks for detecting when an attack occurs, is known as 
misuse, or signature, detection. IDSs that fall in this category are based on a model of 
all the possible misuses of the network resources. The completeness request is actu- 
ally their major limit [2]. 



This work has been partially supported by the Ministero dell'Istruzione, dell'Universita e della Ricerca 
(MIUR) in the framework of the FIRB Project “Middleware for advanced services over large-scale, 
wired-wireless distributed systems (WEB-MINDS)”. 

F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 324^333, 2004. 

© Springer-Verlag Berlin Fleidelberg 2004 




Network Intrusion Detection by a Multi-stage Classification System 325 



A dual approach tries to characterize the normal usage of the resources under 
monitoring. An intrusion is then suspected when a significant difference from the 
resource’s normal usage is revealed. IDSs following this approach, known as anom- 
aly detection, seem to be more promising because of their potential ability to detect 
unknown intrusions. However, in this case the major challenge is the need of acquir- 
ing a model of the normal use general enough to allow authorized users to work 
without raising false alarms, but specific enough to recognize unauthorized usages 
[3,4]. 

Different attack types can occur in a real network. Kendall [5] proposed a taxon- 
omy of attacks, grouping them into four major categories: Probes, Denial of service 
(DoS), Remote to local (R2L) and User to root (U2R). The first category is made up 
of attacks that test a potential target to collect information about a possible intrusion. 
Therefore, they are usually harmless, unless a vulnerability is discovered and later 
exploited. DoS attacks prevent normal operation, causing the target host or a server to 
crash, or blocking network traffic; they, however, do not violate the target host. On 
the contrary, the last two categories group together attacks that permit the attacker to 
compromise the target host. In particular, in R2L attacks, an unauthorized user is able 
to bypass normal authentication and to execute commands on the target host, while in 
U2R attacks, a user with login access is able to bypass normal authentication to gain 
the privileges of another user, typically the root user. 

With this taxonomy in mind, the network intrusion detection problem can be easily 
formulated as a typical pattern recognition problem [6]: given information about 
network connections between pairs of hosts, the task is to assign each connection to 
one out of five classes, that represent normal traffic conditions or one of the four 
different attack categories described above. Here the term “connection” refers to a 
sequence of data packets related to a particular service, as a file transfer via the ftp 
protocol. Since an IDS must detect connections related to malicious activities, each 
network connection can be viewed as a “pattern” to be classified. 

This formulation implies the use of an IDS based on a misuse detection approach. 
The main advantage of the pattern recognition approach is the generalization capabil- 
ity exhibited by pattern recognition systems. They are able to detect some novel at- 
tacks, without the need of a complete description of all the possible attacks’ signa- 
tures, so overcoming one of the main drawbacks of the misuse detection approach. In 
[7] the feasibility of the pattern recognition approach for the intrusion detection prob- 
lem is addressed. Different pattern recognition systems have been proposed in the 
recent past for realizing an IDS, mainly based on neural network architectures [3,8,9]. 
In order to improve the performance, approaches based on multi-expert architectures 
have been also proposed [6,10]. 

However, it should be worth noticing that one of the main drawbacks occurring 
when using pattern recognition techniques in real environments is the high false 
alarm rate they often produce [6] (this drawback, indeed, is shared by a large number 
of commercial IDS). Moreover, the classification is usually performed in a feature 
space made up of all the features needed to detect the considered attack classes. Since 
the distributions of the classes is very unbalanced, it is necessary to employ a quite 
large set of features having different semantic meaning to reliably distinguish be- 
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tween patterns coming from distinct classes. This is not advisable, because the exces- 
sive size of the feature vectors could make more difficult the construction of reliable 
classifiers. 

Starting from these considerations, in this paper we propose a multi-stage classifi- 
cation system for network intrusion detection. Each stage performs a binary classifi- 
cation, distinguishing between normal connections and a single type of attack, and 
utilizes features especially tailored for performing classification between these two 
classes. In order to decrement the number of missed detection, the proposed multi- 
stage system declares a connection as normal traffic only if all the stages do not de- 
tect an attack. Furthermore, in order to keep low the number of false alarms, some 
criteria have been established for evaluating the reliability of each classification act 
directly from the output of the classifiers. The reliability value is used to reject pat- 
terns that are unreliably attributed to an attack class. Reject, in this case, implies that 
the data about a ‘rejected’ connection are only logged for further processing, without 
raising an alert for the system manager. This should be done by using, for example, 
the proposed multi-stage system as an engine detection module of a public-domain 
IDS like Snort [11]. 

The organization of the paper is as follows: in Section 2 the proposed architecture 
is described, while in Section 3 several tests on a standard database used for bench- 
marking IDS are reported, together with a comparison of the proposed system with 
some parallel multi-expert systems. Finally, some conclusions are drawn. 



2 The Proposed Approach 

As anticipated in the introduction, if the detection of an intrusion is performed in a 
feature space made up of all the features needed to detect all the considered attack 
classes, the excessive size of the feature vectors could make difficult the construction 
of a reliable classifier. Moreover, the false alarm rate should be maintained low in 
order to make pattern recognition techniques appealing also for a system manager. 

In order to address the first problem, the proposed architecture is made up of a 
cascade of stages; each stage considers a different set of features and is tailored for 
distinguishing only between normal traffic and one attack class. Therefore each stage 
behaves as a binary classifier on a (possibly) reduced set of features, so augmenting 
the probability of increasing the overall system performance. As regards the flow of 
the decision process, if a stage recognizes a pattern as normal traffic, the pattern is 
forwarded to the successive stage for further classification. In such a way, a pattern is 
recognized as normal traffic only if all the stages attribute it to the normal traffic 
class. This choice should permit to keep low the number of undetected attacks. 

As regards the dual need of decreasing the false alarm rate, it must be noted that 
each stage is made up of an expert, devoted to the classification of an input pattern, 
and of a decider. This latter, on the basis of the output vector provided by the corre- 
sponding expert, estimates the reliability of the classification decision, so isolating all 
the patterns that can be reliably considered as attacks. If a stage was not able to relia- 
bly assigning the sample to an attack, the pattern is rejected. On the contrary, if an 
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expert recognizes a pattern as normal traffic, it is always forwarded to the successive 
stage for further classification, independently of reliability. 

An overview of the decisional flow of the proposed system is given in Fig. 1. Note 
that the reliability parameters, whose values range from 0 to 1, are indicated with y/, 
and the reliability thresholds (formally defined hereafter) with a. These symbols have 
a subscript denoting the stage they refer to. 




Fig. 1. The implemented decisional flow as a function of the classification decisions at the 
different stages and of the corresponding reliability evaluations. 



The classification process starts by presenting the input pattern to the first stage. It 
is devoted to discriminate between normal traffic and attacks belonging to DoS or 
Probe classes. These two attack classes are considered together since their features 
should be quite similar. If the expert of this stage reliably recognizes the pattern as an 
attack, a second stage is activated for discriminating between DoS and Probe classes. 
However, if the classification of one of these two stages is under a suitable threshold, 
the input pattern is rejected, i.e. the data related to the connection are logged for fur- 
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ther processing and no alert is raised. The method used for fixing the thresholds for 
each stage will be illustrated in the next subsection. 

On the other hand, if the first stage does not attribute the input connection to the 
attack class, the stage devoted to discriminate between normal connections and R2L 
attacks is activated. If the expert of this stage attributes the input pattern to the R2L 
class with high reliability, an alert is generated and the classification process stops; on 
the contrary, if the reliability is under a suitably fixed threshold the pattern is rejected 
and the data relative to the input connection are logged. Once again, the process con- 
tinues, instead, if the second stage expert classifies the input pattern as belonging to 
normal traffic, no matter for the associated reliability. In this case, a third (and last) 
stage is activated. It must discriminate between normal connection and U2R classes, 
adopting the same decision rule of the previous stage. In summary, it is worth notic- 
ing that while the attribution to the attack class can be done by a single expert, a pat- 
tern can be assigned to the normal traffic class only after taking into account the re- 
sults of three decision stages. 

The choice of considering DoS and Probe attack together in the first stage is justi- 
fied because connections related to these types of attacks are characterized by a be- 
havior that is typically quite different from the one exhibited by normal connections. 
Thus, it is simpler to detect them at the first step. On the contrary, U2R and R2L 
attacks are more difficult to recognize and are even more dangerous, since their aim is 
to violate the target host. Obviously, the sequence the different stages are activated 
influences the overall system performance. In the next Section, tests will be made to 
confirm that the chosen activation sequence guarantees the best system performance. 



2.1 Reliability Thresholds 

The low reliability of a classification can be traced back to one of the following situa- 
tions: a) the considered sample is significantly different from those present in the 
training set; b) the point which represents the sample considered in the feature space 
lies where the regions pertaining to different classes overlap. To distinguish between 
classifications which are unreliable because a sample is of type a or b, we define two 
reliability parameters, \j/“ and y/^' whose values vary between 0 (completely unreli- 
able) and 1 (very reliable). Each parameter is a function of the classifier output vec- 
tor; in [12] the definition of the reliability parameters in case of some popular neural 
architectures are given. 

A parameter y/ providing an inclusive measure of the reliability of a classification 
can be computed by combining the values of y/“ and y/'’. The form chosen for y/ is: 
y/ = mini yr‘,yr'’]. This is certainly a conservative choice because it implies that a 
low value for just one of the parameters is sufficient to consider unreliable the whole 
classification. However, it is consistent with the kind of classification system consid- 
ered, which is aimed at achieving the highest reliability. 

Once the reliability of each classification has been evaluated, the optimal values of 
the thresholds can be determined by using the method described in [13]. It is assumed 
that an effectiveness function P is defined which, taking into account the require- 
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ments of the particular application, evaluates the quality of the classification in terms 
of correct recognition, misclassification and rejection rates. Under this assumption the 
optimal reject threshold value a, determining the best trade-off between reject rate 
and misclassification rate, is the one for which the function P reaches its absolute 
maximum. 

The requirements of the particular application domain are specified by attributing 
costs to misclassifications, rejects and correct classifications. As in [13], in our case 
the cost of an error is different as a function of the actual class. So a cost matrix C 
has to be defined, whose generic element indicates the cost of misclassifying a pattern 
belonging to the i-th class by attributing it to the y-th class. We can assume that the 
gain for correct classification and the cost of a reject are not dependent on the class of 
the pattern. On the contrary, we can expect that the costs of a misclassification are 
quite different depending on the actual and the guess class. 

Finally, it is worth noticing that misclassification costs are typically higher than the 
reject costs. This is true also in our domain, where a reject implies that the data rela- 
tive to the ‘rejected’ connection are logged by the IDS for an off-line processing and 
so there is still the possibility of detecting an intrusion. 



3 Experimental Results 

The proposed system has been tested on a subset of the database created by DARPA 
in the framework of the 1998 Intrusion Detection Evaluation Program. It is made up 
of a large number of network connections related to normal and malicious traffic. 
This database was pre-processed by the Columbia University giving rise to a feature 
vector of 41 elements for each connection, according to the set of features defined in 
[14] and tailored for the intrusion detection problem. In the database each connection 
is labelled as belonging to one out of the five classes described in the previous Sec- 
tions: normal traffic. Probe, DoS, U2R and R2L attacks. It is worth noticing that each 
attack class is made up of different attack variants, each one exploiting different vul- 
nerabilities of a computer network. 

Our results have been systematically compared with those obtained from some 
single classifiers operating in a single stage and some parallel Multi-Expert Systems 
(MES). As single experts we considered two different neural architectures: a Learn- 
ing Vector Quantization (LVQ) and a three layered Multi-Layer Perceptron (MLP). 
In order to train them, the training data was split into two disjoint sets: a training set 
(in the following TRS) and a training-test set (in the following TTS), used to stop the 
learning process so as to avoid a possible overtraining. Different classifiers have been 
experimented, by varying the number of hidden nodes for the MLP nets and by using 
different numbers of prototypes for the LVQ nets. Also different feature sets have 
been used for training. 

Different combining rules have been taken into account in order to build several 
MES, each one composed by two or three of the previously described neural architec- 
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tures. In particular, both “fixed” (Majority Vote, Min, Max, Average) and “trainable” 
(Naive Bayes, Dempster-Shafer, BKS) combining rules have been considered. 

The results obtained by all the considered classification systems are reported in 
terms of i) the overall classification error, ii) the sum of the false alarm and the 
missed detection rates, without taking into account class confusion among attack 
classes, and Hi) the average classification cost calculated according to the cost matrix 
shown in Table 1. This type of evaluation has been proposed in [7] so as to have a 
more significant parameter for evaluating the effectiveness of an IDS. The average 
cost is computed by multiplying each entry of the confusion matrix with the corre- 
sponding entry in the cost matrix and dividing the result by the total number of test 
samples. The cost matrix of Table 1 is also used for evaluating the reliability thresh- 
olds by using the method illustrated in Section 2.1. In this case the cost of a reject is 
assumed to be 0.5. 

Table 1. Cost matrix used to weight the confusion matrix related to each classifier and to calcu- 
late the reliability thresholds. The cost of a reject is assumed to be 0.5. 





Guess Class 


Actual Class 


Normal 


DoS 


U2R 


R2L 


Probe 


Normal 


0 


2 


2 


2 


1 


DoS 


2 


0 


2 


2 


1 


U2R 


3 


2 


0 


2 


2 


R2L 


4 


2 


2 


0 


2 


Probe 


1 


2 


2 


2 


0 



In the following we present the results obtained by our classification method ap- 
plied to two different network services (ftp and http) among those present in the 
DARPA database. Other services have been also experimented, but the obtained re- 
sults are not reported here for the sake of brevity. The choice of designing a different 
multi-stage classifier system for each service follows the so-called modular approach 
presented in [10], where the authors experimentally demonstrate the advantage, in 
terms of recognition performance, of an IDS that develops a different classification 
module for each one of the network services to be protected. 



3.1 FTP Service 

In this case the training data was made up of 798 patterns related to different attacks 
and to the normal class. These samples are not well balanced among the different 
classes: in particular, there are very few samples of U2R and Probe attacks. For this 
reason such samples have been duplicated before training, giving rise to 841 training 
data. This procedure was motivated by the fact that it is possible to easily retrieve on 
the Web the code for launching some type of attacks, and then it is reasonable to 
assume the possibility of having several identical connections in case of attacks. We 
use 65% of training data as TRS (549) and the remaining 35% for TTS (292). The 
Test Set (TS) for this service is made up of 825 patterns. 
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Table 2 reports the results of the best single classifiers and of the best parallel MES 
made up of three experts. As it is evident, the results in terms of overall error are very 
poor, while the percentage of false alarms and missed detections is quite acceptable. 
In this case the best MES does not significantly improve the performance obtained by 
the best single classifier. 

Table 3 shows the characteristics of each stage of the proposed architecture. All 
the stages employ an LVQ net as expert, trained with the same modalities of the sin- 
gle classifiers described above. Note that different feature sets have been selected at 
different stages. 

Table 2. Results obtained by the best single experts and by the best Parallel MES on the TS. 



Classifier 


Overall 


Cost 


False+Missed 


Single 


MES 


Error 


alarm rates 


LVQ (all features) 




13.82 % 


0.3358 


3.76 % 


LVQ (six features) 




14.18 % 


0.2861 


1.09% 




Average 


13.70 % 


0.2564 


1.05 % 




Majority Vote 


13.70 % 


0.2564 


1.05 % 



Table 3. Details about each stage of the proposed system. 



Stage 


Neural 

Architecture 


Features 


Feature 

Normalization 


Threshold 


I 


LVQ 


duration, src_bytes, 
dst_bytes, same_srv_rate, 
dst_host_srv_diff_host_rate 


Yes 


o,= 0.406 


II 


LVQ 


All 


Yes 


c„= 0.210 


III 


LVQ 


All 


No 


o,„= 0.100 


II’ 


LVQ 


All 


Yes 


c„,j= 0.001 

c„,2= 0.003 



The results obtained on the TS by the proposed system are reported in Table 4. For 
the sake of comparison, both the systems with and without the use of the reject option 
have been considered. The overall error is significantly lower with respect to the 
previous case. Moreover, the false alarm rate further decreases with the introduction 
of the reject option, as it should be. 

It is worth noticing that the chosen activation sequence for the stages of the pro- 
posed architecture is really the most effective one. In fact, the overall error, without 
the reject option, raises to the 9.45% if the II Stage is chosen as the first one in our 
architecture, and to the 8.36% if the III Stage is the first to be activated. 



Table 4. Results obtained on the TS by the proposed multi-stage classification system, with and 
without the reject option. 





Overall error 


Cost 


False+Missed alarm rates 


Without reject option 


2.79 % 


0.0509 


0.85 % 


With reject option 


0.72% 

(with 3.76 % of reject) 


0.0309 


0.24 % 
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3.2 Http Service 

The training data for this service in the DARPA database are made up of 64292 pat- 
terns. However, in [9] it has been demonstrated that a dataset of about 15% of the 
whole http data is sufficient to training neural classifiers. Therefore, only 8866 sam- 
ples have been considered as training data. Differently from the previous case, there 
are no attacks belonging to the U2R class. This implies that the proposed multi-stage 
classification system for this service do not present the 111 Stage. Also in this case 
samples of the R2L and Probe classes have been duplicated before training. The 70% 
of the whole training data has been used as TRS and the remaining 30% for the TTS. 
The TS for this service is made up of 40442 patterns. 

Table 5 reports the results of the best single classifiers and of the best parallel MES 
made up of three experts. In this case the performance are quite good, especially as 
regards the number of false alarms. The best MES exhibits exactly the same perform- 
ance of the best single expert. 

Table 5. Results obtained by the best single experts and by the best Parallel MES on the TS. 



1 Classifier 


Overall 

error 


Cost 


False +Missed 
alarm rates 


Single 


MES 


LVQ 




0.22 % 


0.0027 


0.09 % 


MLP 




0.27 % 


0.0033 


0.14% 




Average 


0.22 % 


0.0027 


0.09 % 




Majority Vote 


0.22 % 


0.0027 


0.09 % 



Table 6 shows the characteristics of each stage of the proposed architecture; all of 
them employ an LVQ net as expert. Once again, these experts were trained with the 
same modalities of the single classifiers reported in Table 5. In this case, the feature 
set is the same for all the three stages, even if in one case no normalization was per- 
formed. Since the expert present at each stage is very reliable, no thresholds were 
fixed by the proposed method in this case. 



Table 6. Details about each stage of the proposed system. 



Stage 


Neural 

Architecture 


Features 


Feature 

Normalization 


Threshold 


I 


LVQ 


All 


Yes 


0.00 


II 


LVQ 


All 


Yes 


0.00 


II’ 


LVQ 


All 


No 


0.00 



Einally, table 7 shows the results of the proposed system on the TS. Also in this 
case the overall error decreases with respect to the previous case, keeping low the 
false alarm rate. This confirms the effectiveness of the proposed approach. 

Table 7. Results obtained by the proposed multi-stage classification system on the TS. 



Overall error 


Cost 


False +Missed alarm rates 


0.11 % 


0.0014 


0.09 % 
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4 Conclusions 

In this paper we have presented a multi-stage classification system for network intru- 
sion detection. For this architecture, we have used some criteria for evaluating the 
reliability of the response and a method for the determination of an optimal reject 
option, in order to reduce the false alarm rate of the system. The effectiveness of our 
proposal has been experimentally evaluated on a standard database, where a signifi- 
cant improvement in the reliability of the system has been demonstrated. 
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Abstract. Leo Breiman’s Random Forest ensemble learning procedure 
is applied to the problem of Qnantitative Structure- Activity Relation- 
ship (QSAR) modeling for pharmaceutical molecules. This entails using 
a quantitative description of a componnd’s molecular structure to pre- 
dict that compound’s biological activity as measnred in an in vitro assay. 
Without any parameter tnning, the performance of Random Forest with 
default settings on six pnblicly available data sets is already as good or 
better than that of three other prominent QSAR methods: Decision Tree, 
Partial Least Squares, and Support Vector Machine. In addition to re- 
liable prediction accuracy, Random Forest provides variable importance 
measures which can be used in a variable reduction wrapper algorithm. 
Comparisons of various such wrappers and between Random Forest and 
Bagging are presented. 



1 Introduction 

In the drug discovery process, candidate drug compounds are assayed for po- 
tency against disease targets; but they must also be assayed for toxicity and 
pharmacokinetic properties involving absorption, distribution, metabolism, and 
excretion (AD ME). These biological activities are usually measured in in vitro 
bioassays. The activities could be continuous numbers (like drug concentration 
producing 50% inhibition) or categorical labels (like active/inactive). We would 
like to model the relationship between a compound’s biological activity and its 
molecular structure, where the latter is characterized quantitatively by a set of 
topological descriptors. This problem is an example of Quantitative Structure- 
Activity Relationship (QSAR) modeling [Eki00,Haw01]. The prediction of a con- 
tinuous response is equivalent to regression; that of a categorical response is 
equivalent to classification. 

The number of descriptors, p, usually exceeds the number of samples, n, and 
many of the descriptors may be irrelevant to predicting the activity of interest. 
Traditional statistical methods (multiple linear regression [MLR] , linear discrim- 
inant analysis [LDA], and /c-nearest neighbors [kNN]) cannot be used reliably 
without a sophisticated variable selection filter, such as, for example, a genetic al- 
gorithm. This approach is indeed taken by some investigators. Other approaches 



F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 334—343, 2004. 
@ Springer- Verlag Berlin Heidelberg 2004 




Application of Breiman’s Random Forest 



335 



found in the literature include Decision Tree (recursive partitioning [RP]), Par- 
tial Least Squares (PLS), artificial neural networks (ANN), and support vector 
machines (SVM), although ANN and SVM again often require variable pre- 
selection if there are a large number of irrelevant descriptors. The Decision Tree 
is relatively free of the aformentioned limitations; however, it suffers from low 
accuracy. Ensembles of trees are a natural choice for QSAR modeling, since they 
combine the desirable properties of Decision Trees with high prediction perfor- 
mance. Among the most promising ensemble learning methods are boosting and 
Random Forest, and in this paper we study Random Forest, developed by Leo 
Breiman [BreOl]. We are currently investigating boosting, which will be the topic 
of a future report. 

2 Random Forest 

The Algorithm. Like Bagging, Random Forest is an ensemble of unpruned 
trees. Each tree is trained on a bootstrap sample of the training data, and 
predictions are made by majority vote of the trees (in classification) or averaging 
their outputs (in regression). Random Forest differs from Bagging in that at 
each node of each tree, the algorithm considers as splitting candidates a random 
sample of the variables instead of all the variables. The size of the variable 
subset is a fixed value, mtry, with default value for classification and p/3 
for regression. The idea is to maintain the “strength” of the trees while reducing 
their correlation with each other. Breiman [BreOl] has shown that an upper 
bound on the generalization error of Random Forest is given by r(l — s^)/s^, 
where r is a measure of the correlation between the trees, and s is a measure 
of their strength (see [BreOl] for the details). Since the unpruned trees are low- 
bias, high variance models, averaging over an ensemble of trees reduces variance 
while keeping low bias (see below). It is also thought that an ensemble of trees 
mitigates the semi-artificiality of the tree structure (hyper-rectangular partition 
of the descriptor space) and the greediness of the tree-growing algorithm, which 
are arguably the two drawbacks of the Tree approach. 

Random Forest also provides additional features that increase its utility for 
QSAR modeling (see [BreOl, Sve03] for further discussion): 

1. Out-of-bag predictions, useful for error estimation or threshold selection (see 
below for explanation); 

2. A measure of variable importance, useful for model interpretation; and 

3. A measure of intrinsic proximity between two compounds, useful for com- 
puting “neighbor molecules”. (Not discussed further here.) 

Out-of-Bag Predictions. Since each tree in the ensemble is grown on a boot- 
strap sample of the data, the molecules left out of the bootstrap sample, the 
“out-of-bag” (OOB) data, can be used as a legitimate test set for that tree. 
On average, one-third of the training data will be “out-of-bag” for a given tree 
[HasOl]. Consequently, each molecule in the training data will be left out of (on 
average) 1/3 of the trees in the ensemble; we can compile the predictions for each 
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molecule when it is “out-of-bag” and use these OOB predictions to estimate the 
error rate of the full ensemble. This is similar to a cross-validation performance 
estimate, but at a much lower computational cost. 

Variable Importance. Random Forest’s variable importance measure is based 
on the following heuristic. When a descriptor that contributes to prediction ac- 
curacy is “noised up” (e.g., replaced with random noise), the accuracy of predic- 
tion should noticeably degrade. On the other hand, if a descriptor is irrelevant, 
“noising” it up should have little effect on the performance. A number of specific 
variable importance measures were proposed by Breiman based on this heuristic 
[BreOl]. We currently use an importance measure that he proposed recently to 
remedy a subtle problem with the previous meausres (Breiman, personal com- 
munication). The new measure is computed as follows: first compute the OOB 
error rate (or MSE) of each tree, and also compute the same for OOB data with 
one variable permuted; take the difference between these. The new measure is 
the mean difference (over all data) divided by the standard error of these differ- 
ences. This variable importance measure could be used to select a subset of the 
most important descriptors, and partial dependence plots [HasOl] can be pro- 
duced for each of these, to see the trend, e.g., whether increasing a descriptor’s 
value tends to increase the biological activity (in regression), or probability of 
activity (in classification, where probability is interpreted as proportion of votes 
in favor of the majority class) . 



3 Performance Assessment 

In [Sve03] , we selected six publicly available ADME data sets for the assessment 
of Random Forest in competition with two other popular QSAR methods. Recur- 
sive Partitioning (RP) and Partial Least Squares (PLS). The performance was 
evaluated by computing test set predictions from 5-fold cross-validation (CV), 
repeated on 50 different partitions of the data; we report median accuracy rates 
(percent of compounds correctly classified, in classification) and correlations be- 
tween actual and predicted value (in regression). We review these results here 
and include new results from Support Vector Machine (SVM) using both a linear 
kernel and a radial-basis function (RBF) kernel with optimized parameters. 

Briefly, the five classification data sets are the following: BBB (blood-brain 
barrier permeability) [Don02], ER (estrogen receptor binding activity) [Ton03], 
P-gp (P-glycoprotein transport activity) [Pen02], MDRR (multidrug resistance 
reversal activity) [BakOO], and COX-2 (inhibition of cyclooxygenase-2) [KauOl]. 
The COX-2 data were assigned to categorical labels based on a cutoff on nu- 
merical measurements of their log(IC 5 o); direct prediction of these numerical 
measurements was also done as a regression problem. (The number of molecules 
is greater in the classification case because there were some molecules whose 
activity was known only qualitatively, not quantitatively.) Another regression 
data set, D2, was related to predicting the log(IC 5 o) for binding affinity to the 
dopamine D 2 receptor [Gil92]. Please refer to [Sve03] for further details of these 
data sets, such as the types of descriptors used. The results of the performance 
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Table 1. Median accuracy rates for classification test data from 50 5- fold cross- 
validations {n = number of compounds; p = number of descriptors) 



Data 


n 


P 


RF 


RP 


PLS 


SVM 

(linear) 


SVM 

(RBF) 


BBB 


325 


9 


0.809 


0.737 


0.691 


0.735 


0.782 


ER 


232 


197 


0.828 


0.759 


0.813 


0.765 


0.815 


P-gp 


186 


1522 


0.806 


0.712 


0.769 


0.801 


0.804 


MDRR 


528 


342 


0.830 


0.780 


0.821 


0.813 


0.831 


COX-2 


314 


135 


0.780 


0.736 


0.774 


0.774 


0.774 



Table 2. Median correlation between predicted and actual values for regression test 
data from 50 5-fold cross-validations (n = number of compounds; p = number of 
descriptors) 



Data 


n 


P 


RF 


RP 


PLS 


SVM 

(linear) 


SVM 

(RBF) 


COX-2 


272 


135 


0.658 


0.525 


0.573 


0.608 


0.646 


D2 


116 


374 


0.698 


0.504 


0.658 


0.696 


0.708 



assessments are shown in Tables 1 and 2. The results show, over a range of data 
sets, that Random Forest at its default settings is already as good as or better 
than the other competing methods. Parameter tuning will be discussed in the 
next section. 

For the P-gp data set, we used a set of 1522 in-house generated atom pair 
descriptors. As a “stress test” of the Random Forest algorithm, we also generated 
an in-house set of 43,928 three-dimensional fingerprint descriptors for the same 
data, and were able to obtain an median CV accuracy of 0.80 (although the 
calculations required considerably more time). 

4 Parameter Optimization and Variable Reduction 

Random Forest has a parameter, mtry, that could be considered a tuning param- 
eter. In [Sve03] we showed empirically that the performance of Random Forest 
using a fixed set of descriptors is often relatively insensitive to the choice of mtry, 
specified as a function of the number of descriptors (like mtry = over a 

large range of choices, as long as mtry is far from its minimum or maximum 
possible values (1 or p, respectively). However, the sensitivity of the algorithm 
to the choice of mtry may depend on the proportion of irrelevant variables in 
the training data. Two other parameters in Random Forest include the number 
of trees and the minimum node size. The number of trees should simply be suf- 
ficiently large that the ensemble statistic of interest has stabilized. For instance, 
if one is interested in accuracy rates and variable importance measures, in the 
P-gp data set, we found that 1000 trees is adequate. However, if one is interested 
in the proportion of votes for a given class, we found that we have to grow at 
least 10,000 trees to get stable results. As for the minimum node size, which is 
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the minimum size of nodes below which no split will be attempted, in classifi- 
cation this has a default value of 1 (allowing trees of full depth to be grown) 
and in regression the default is 5. The latter case is somewhat arbitrary, but the 
performance of Random Forest is relatively insensitive to changes in its value 
(as long as it is kept small). 

Variable reduction based on Random Forest’s variable importance measure is 
another potential way to optimize the Random Forest algorithm. The removal of 
irrelevant variables may improve the performance of the algorithm upon retrain- 
ing and may help improve the interpretability of the model. As an illustration, 
we implemented the following variable reduction “wrapper” algorithm: 

1. Partition the data for 5-fold cross-validation (CV). 

2. On each CV training set, train a model on all variables and use the variable 
importance measure to rank them. Record the CV test set predictions. 

3. Use the variable ranking to remove the least important half of the variables 
and retrain the model, predicting the CV test set. Repeat removal of half of 
the variables until there are about 2 left. 

4. Aggregate results from all 5 CV partitions and compute the error rate (in 
classification) or mean squared error (in regression) at each step of halving. 

5. Replicate steps (l)-(4) 20 times to “smooth out” the variability. 

It is vital to note that this procedure is non-recursive; that is, on each training 
run of CV, the variable importance is calculated just once, at the beginning, 
and is not recalculated repeatedly as the variables are reduced. (A recursive 
version of this procedure is much greedier and, in our experience, has much 
worse performance.) Note also that the CV error, and not the OOB error, is 
used to assess performance in the wrapper algorithm. (In the next section, we 
will show that the use of OOB data for this purpose leads to severe overfitting.) 

On the P-gp data, the median error rates for the 50 replications, with medians 
connected by line segments, is shown in Fig. 1, for various choices of mtry. The 
cases of mtry equaling p (equivalent to bagging), p/2, p/4, and the default 
are considered. The plot shows that the default mtry, p^^^, performs the best, 
but the other choices are only a few percent worse. Also, the performance remains 
about the same as irrelevant variables are removed, until reaching 191 variables; 
further variable reduction will degrade it. 

The robustness of Random Forest’s performance to the presence of irrelevant 
variables is illustrated here. It is certainly possible to improve the performance 
by a small amount by tuning parameters or reducing variables. For instance, 
in the COX-2 data set, we found that Bagging outperforms Random Forest by 
a couple percent [Sve03]. We have never yet observed a case where the perfor- 
mance actually improves as variables are reduced. This, and the relatively small 
change in performance due to different choices of mtry, show that Random For- 
est can perform reasonably well “off the shelf” without a lot of tuning or variable 
reduction. 

Ambroise and McLachlan [Amb02] and Reunanen [Reu03] have argued com- 
pellingly that the assessment of performance of a learning machine with variable 
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Bagging 

p/2 

p/4 

sqrt(p) 



3 6 12 24 48 96 191 381 761 1522 

number of variables 



Fig. 1. Median CV test error rates at each step of halving the important variables, 
using different mtry functions, for the P-gp data. Line segments connect the medians 
of 20 5-fold CV error rates 

selection requires great care. Both the variable selection and the supervised train- 
ing should be embedded within a CV procedure in order to obtain an honest 
assessment of the total learning system’s performance. If the variable selection 
is done outside of CV, a selection bias is introduced. Following this principle, 
a performance assessment of our procedure requires that the entire algorithm 
be nested within another CV loop [Reu03]. For the P-gp data set, the median 
nested CV error rate based on 10 replications of 5-fold CV embedded within 10 
replications of 10-fold CV is 0.191. This performance is actually quite consistent 
with Fig. 1, indicating that the selection bias is negligible in this example. 

5 Comparison with Other Wrappers 

At first glance, a reasonable alternative to the variable reduction procedure out- 
lined above is one that is exactly the same, except that the variable importance 
is recalculated at step 3, producing a new ranking of the variables. This corre- 
sponds to a recursive feature elimination procedure similar to the one used by 
Guyon et al. [Guy02] . Our limited experience shows that this approach performs 
poorly, since it is more prone to overfitting than the non-recursive approach. A 
comparison of the performance of these two approaches is shown in Fig. 2 for the 
P-gp data and Fig. 3 for the D2 data. As variable reduction proceeds onward, 
the recursive approach produces much higher error rates, especially in the D2 
case, than the non-recursive approach. Because of its inferior performance, we 
recommend avoiding the recursive procedure. 
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Fig. 2. Comparisons of the medians of CV test error rate performance of our wrap- 
per variable selection procedure and a recursive version of it (P-gp data). The traces 
correspond to ten replications of 5-fold cross-validation 




No. of AP Descriptors 

Fig. 3. Comparisons of the medians of CV test error rate performance of our wrap- 
per variable selection procedure and a recursive version of it (D2 data). The traces 
correspond to ten replications of 5-fold cross-validation 
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Another variable reduction wrapper that could be attempted is one that 
uses the Random Forest OOB prediction error, rather than the CV error, to 
assess performance in step 4 of the wrapper algorithm. We tried this with the 
P-gp data with the class labels randomly scrambled. We compared the CV Test 
set error and the OOB prediction error during variable selection (see Fig. 4) 
The CV Test set error does reflect what we expect: performance close to that 
of random guessing, regardless of how many variables are used. On the other 
hand, the OOB prediction error rate incorrectly suggests that performance can 
be improved substantially by doing variable reduction. This shows that the use 
of out-of-bag error estimates for iterative performance assessment lends itself 
to severe overfitting. This is because the OOB error estimate for any reduced 
number of variables is contaminated by the initial variable ranking, which was 
based on all the data. The curve in Fig. 4 shows that with a sufficient number 
of variables remaining, this wrapper can do a good job of fitting to noise. 




Fig. 4. Medians of CV test set and out-of-bag prediction error rates at each step of 
halving the important variables (P-gp data with randomized response vector). Line 
segments connect the medians of 20 5-fold CV error rates 

6 Conclusion 

Random Forest, as a tree ensemble learning method, has a combination of desir- 
able properties for QSAR modeling and excellent prediction performance when 
used “off the shelf” (i.e., without parameter tuning or variable selection). We 
also showed how to use a variable selection wrapper with Random Forest, and 
how this wrapper is superior to two other possible wrappers. These results con- 
vince us that tree ensemble methods are very promising for QSAR modeling in 
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the drug discovery process, and we continue to explore Random Forest and other 
methods like Boosting for these applications. 

7 Software 

Open source software for Random Forest is publicly available. The Fortran 
code for the Random Forest software, written by L. Breiman and A. Cutler, 
is found at http : //www. stat .berkeley . edu/users/breiman/RandomForests/ 
and an R interface for it by A. Liaw and M. Wiener [Lia02] can be found at 
http://cran.us.r-project.org/ by looking for the randomForest package. 
The SVM software we used is from SPIDER, found at 
http : / / WWW . kyb . tuebingen . mpg . de/bs/people/ spider/ . 
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Abstract. Multiple classifier systems provide an effective way to improve pat- 
tern recognition performance. In this paper, we use multiple classifier combina- 
tion to improve LDA for high dimensional data classification. When dealing 
with the high dimensional data, LDA often suffers from the small sample size 
problem and the constructed classifier is biased and unstable. Although some 
approaches, such as PCA-l-LDA and Null Space LDA, have been proposed to 
address this problem, they are all at cost of discarding some useful discrimina- 
tive information. We propose an approach to generate multiple Principal Space 
LDA and Null Space LDA classifiers by random sampling on the feature vector 
and training set. The two kinds of complementary classifiers are integrated to 
preserve all the discriminative information in the feature space. 



1 Introduction 

Multiple classifier combination is an effective way to improve pattern recognition 
performance. Random subspace [4] and bagging [5] are two popular techniques to 
combine weak classifiers into a powerful decision rule. In the random subspace 
method, a set of low dimensional subspaces are generated by randomly sampling 
from the high dimensional feature vector and multiple classifiers constructed in the 
random subspaces are combined in the final decision. In bagging, random independ- 
ent bootstrap replicates are generated by sampling the training set. A classifier is 
constructed from each replicate, and the results of all the classifiers are finally inte- 
grated. Based on the two random sampling techniques, we propose an approach using 
multiple LDA classifier combination for high dimensional data classification. 

Linear Discriminant Analysis (LDA) is a popular feature extraction technique for 
data classification. It determines a set of projection vectors maximizing the between- 
class scatter matrix (S^,) and minimizing the within-class scatter matrix (S„) in the 
projective feature space. But when dealing with the high dimensional data, LDA 
often suffers from the small sample size problem. When there are not enough training 
samples, S„ is not well estimated and may become singular [3]. 

To address this problem, a two-stage PCAh-LDA approach [1] is proposed. The 
high dimensional data is first projected to a low dimensional PC A subspace, in which 
S„ is non-singular, and then LDA is performed. We call it Principal Space LDA. 
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The eigenvectors with small eigenvalues removed from the PCA subspace may also 
encode some information helpful for recognition. Their removal may introduce a loss 
of discriminative information. 

Chen et. al. [2] suggested that the null space spanned by the eigenvectors of S„ 
with zero eigenvalues contains the most discriminative information. However, as 
explained in [2], with the existence of noise, when the training sample number is 
large, the null space of S„ becomes small, so much discriminative information out- 
side this null space will be lost. 

Some random sampling based LDA classification approaches can be found in 
[7][8]. Different from the previous work, our method simultaneously samples on the 
feature space and training samples, and takes advantage of the discriminative infor- 
mation in both the principal and null spaces of S^^, . We also explain that both Princi- 
pal Space LDA (P-LDA) and Null Space LDA (N-LDA) encounter the overfitting 
problem, but for different reasons. So we will improve them in different ways accord- 
ingly. A more detailed description on the algorithm can be found in [9] [10]. In this 
paper, we make an extensive experimental study on the XM2VTS database [12]. 

2 LDA for High Dimensional Data Classification 

Two conventional LDA approaches, PCAh-LDA and N-LDA are briefly reviewed in 
this section. The high dimensional data is represented as a vector x with length N. 
The training set contains M samples belonging to L classes. 



2.1 PCA+LDA 



Principal Component Analysis (PCA) computes a set of eigenvectors of the ensemble 
covariance matrix C of the training set. Eigenvectors are sorted by eigenvalues, which 
represent the variance of data distribution. There are at most M-1 eigenvectors with 
non-zero eigenvalues. Normally K eigenvectors, U = with the largest 

eigenvalues, are selected to span the PCA subspace. Low dimensional features are 
extracted by projecting the high dimensional data X into the PCA subspace, 

w = U^ {x-m). (1) 

where m is the mean of the training set. 

LDA tries to find a set of projecting vectors W maximizing the ratio of determi- 
nant of 5^ and the determinant of , 



W = arg max 



w'^ShW 
W^S^W ' 



( 2 ) 



W can be computed from the eigenvectors of [6]. The rank of 5'^^, is at most 

M-L. But when the training set is small and M-L is smaller than the vector length N, 
Sy^, may become singular and it is difficult to compute 5”^ . 
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In the two-stage PCAh-LDA approach [1], the data vector is first projected to a 
PC A subspace spanned by the M-L largest eigenvectors. LDA is then performed in 
the M-L dimensional subspace, such that S„ is nonsingular. But in many cases, M-L 
dimensionality is still too high for the training set. So the LDA classifier is often 
biased and unstable. Furthermore, much discriminative information outside the PCA 
subspace is discarded. 

2.2 Null Space LDA 

Chen et. al. [2] suggested that the null space of also contains much discriminative 
information. It is possible to find some projection vectors W satisfying W S„W = 0 

and W SjjW 0 , thus the Fisher criteria in Eq. (2) definitely reaches its maximum 
value. The rank of r(S„), is bounded by min(M — L,n). Because of the exis- 
tence of noise, is almost equal to this bound. The dimension of the null space 

is max(0, A-M 4-L). As shown by experiments in [2], when the training sample 
number is large, the null space of j'n, becomes small, thus much discriminative 
information outside this null space will be lost. 



3 Multiple LDA Classifier Combination 
for High Dimensional Data Classification 

Both P-LDA and N-LDA face the same problem: the constructed classifier is unstable 
and much discriminative information is discarded. But they are caused by different 
reasons. So we design different random sampling algorithms to improve the two LDA 
methods, and combine them in a multiple classifier structure. 



3.1 Using Random Subspace to Improve P-LDA 

In P-LDA, overfitting happens when the training set is relatively small compared to 
the high dimensionality of the feature vector. In order to construct a stable LDA clas- 
sifier, we sample a small subset of features to reduce discrepancy between the train- 
ing set size and the feature vector length. Using such a random sampling method, we 
construct a multiple number of stable LDA classifiers, and combine them into a pow- 
erful classifier covering the entire feature space without losing discriminative infor- 
mation. 

We first apply PCA to the training set. All the eigenvectors with zero eigenvalues 
are removed, since all the training samples have zero projections on them. The M-1 
eigenvectors Uq ={«[,..., } with positive eigenvalues are retained as candidates 

to construct random subspaces. Then, K random subspaces are generated. The dimen- 
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sion of random subspace is determined by the training set to make the LDA classifier 
stable. In each random subspace, the first dimensions are fixed as the largest ei- 
genvectors, and the remaining dimensions are randomly selected from 
The N„ largest eigenvectors encode much data structural infor- 
mation. If they are not included in the random subspace, the accuracy of LDA classi- 
fiers may be too low. Our approach guarantees that the LDA classifier in each ran- 
dom subspace has satisfactory accuracy. The Ni random dimensions cover most of 
the remaining small eigenvectors. So the ensemble classifiers also have a certain 
degree of error diversity. 

3.2 Using Bagging to Improve N-LDA 

In N-LDA, the overfitting problem happens when the training sample number is 
large, since the null space will be too small. It can be alleviated by bagging. In bag- 
ging, random independent bootstrap replicates are generated by sampling the training 
set, so each replicate has a smaller number of training samples. We Generate K repli- 
cates by randomly sampling the training set. A N-LDA classifier is constructed from 
each replicate and the multiple classifiers are combined using a fusion rule. 

3.3 Integrating Random Subspace and Bagging for LDA Based Classification 

While P-LDA is computed from the principal subspace of , in which 
W Sy^W A 0 , N-LDA is computed from its orthogonal subspace in which 

W S^W = 0 . Both of them discard some discriminative information. Fortunately, 
the information retained by the two kinds of classifiers complements each other. So 
we combine them to construct the final classifier. Many methods on combining mul- 
tiple classifiers have been proposed [11]. In this paper, we use two simple fusion 
rules: majority voting and sum rule. More complex combination algorithms may 
further improve the system performance. 

4 Experiments 

We apply the random sampling based LDA approach to face recognition and make a 
extensive experimental study on the XM2VTS face database [12]. There are 295 
people, and each person has four frontal face images taken in four different sessions. 
In our experiments, two face images of each class are selected for training, and the 
remaining two for testing. In preprocessing, the face image is normalized by transla- 
tion, rotation, and scaling, such that the centers of two eyes are in fixed positions. A 
46 by 81 mask removes most of the background. So the face data dimension is 
46x81 = 3726 . We adopt the recognition test protocol used in FERET [13]. All the 
face classes in the reference set are ranked. We measure the percentage of the “cor- 
rect answer in top 1 match”. 
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4.1 Random Subspace LDA 

We first compare random subspace LDA with the conventional PCA+LDA approach. 
Table 1 reports the accuracy of a single P-LDA classifier constructed from PC A sub- 
space with different dimension. Since there are 590 face images of 295 classes in the 
training set, there are 589 eigenfaces with non-zero eigenvalues. According to [1], the 
PCA subspace dimension should be M-L=295. However, the result shows that the 
accuracy is only 79% using a single P-LDA classifier constructed from 295 eigen- 
faces, because this dimension is too high for this training set and S„ cannot be well 
estimated. We observe that P-LDA classifier has the best accuracy 92.9% when the 
PCA subspace dimension is set at 100. So for this training set 100 seems to be a suit- 
able dimension to construct a stable P-LDA classifier. In the following experiments, 
we choose 100 as the dimension of random subspaces to construct the multiple P- 
LDA classifiers. 

First, we generate the random suhspaces hy randomly selecting 100 eigenfaces 
from 589 eigenfaces with nonzero eigenvalues. The result of combining 20 P-LDA 
classifiers using majority voting is shown in Figure 1. The accuracy of each individ- 
ual P-LDA classifier is low, between 50% and 70%. Using majority voting, the weak 
classifiers are greatly enforced, and 87% accuracy is achieved. This shows that P- 
LDA classifiers constructed from different random subspaces are complementary of 
each other. In Table 2, as we increase the classifier number K, the accuracy of the 
combined classifier improves, and even becomes better than the highest accuracy in 
Table 1. Although increasing classifier number and using more complex combining 
rules may further improve the performance, it will increase the system burden. 



Table 1. Recognition accuracy of PCA-l-LDA classifier constructed from PCA subspace with 
different dimension. 



Dim 


30 


50 


70 


100 


150 


200 


250 


295 


Accuracy 


0.870 


0.925 


0.927 


0.929 


0.898 


0.864 


0.820 


0.792 




Fig. 1. Recognition accuracy of combing 20 P-LDA classifiers constructed from random sub- 
spaces using majority voting. Each random subspace randomly selects 100 eigenfaces from 589 
eigenfaces with non-zero eigenvalues. 
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Table 2. Accuracy of combining different number (K) of P-LDA classifiers constructed from 
random subspaces using majority voting. Each random subspace randomly selects 100 eigen- 
faces from 589 eigenfaces with non-zero eigenvalues. 



K 


20 


40 


60 


80 


100 


120 


140 


160 


Accuracy 


0.871 


0.907 


0.917 


0.922 


0.937 


0.932 


0.939 


0.939 



Table 3. Recognition accuracy of P-LDA classifiers constructed from different parts of eigen- 
face sequence which has been sorted by eigenvalues. The first row is the index of eigenfaces 
spanning the suhspace from which EDA classifier is constructed, and the second row is the 
recognition accuracy. 



Index 


1-100 


101-200 


201-300 


301-400 


401-500 


501-589 


vote 


Accuracy 


0.929 


0.514 


0.378 


0.148 


0.06 


0.04 


0.613 



Table 4. Recognition accuracy of combining P-LDA classifiers using different number (K) of 
random subspaces (sum rule). In each random subspace, the first 50 dimensions are fixed as 
the 50 largest eignfaces, and another 50 dimensions are randomly selected from the remaining 
593 eigenfaces with positive eigenvalues. We run ten times on the same training set and testing 
set, and record the accuracy means and variances. 



K 


5 


10 


15 


20 


25 


30 


Mean 


0.954 


0.958 


0.959 


0.961 


0.961 


0.962 


Variance 


0.0133 


0.0127 


0.0094 


0.0101 


0.0068 


0.0049 



Some largest eigenfaces encode much face structural information. If they are not 
included in the random subspace, the individual LDA classifier is poor. This can be 
further proved in Table 3, in which six LDA classifiers are constructed based on dif- 
ferent parts of eigenface sequence. The first row is the index of eigenfaces spanning 
the subspace. Using only the eigenfaces with small eigenvalues, the recognition accu- 
racy of LDA classifier is poor. But it doesn’t mean these eigenfaces are not useful for 
recognition. 

A better approach to improve the performance of the combined classifier is to in- 
crease the accuracy of each individual weak classifier. To improve the accuracy of 
each individual P-LDA classifier, as illustrated in Section 3.1, in each random sub- 
space, we fix the first 50 basis as the 50 largest eigenfaces, and randomly select an- 
other 50 basis from the remaining 539 eigenfaces. As shown in Figure 2, individual 
P-LDA classifiers are improved significantly. They are similar to the LDA classifier 
based on the first 100 eigenfaces. These classifiers are also complementary of each 
other, so much better accuracy (96%) is achieved when they are combined. The rec- 
ognition performance of using different number of random subspaces is shown in 
Table 4. We run 10 times on the same training set and testing set, recording the accu- 
racy means and variances. Using more random subspaces, the accuracy is higher and 
more stable. 

We also apply random subspace to N-LDA. Similar to the method in Section 3.1, 
the random subspaces with dimension D (295<D<590) are generated from PCA sub- 
space and a N-LDA classifier is constructed from each random subspace. As shown 
in Figure 3, there is no improvement in recognition performance. When the random 
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subspace dimensionality D is low, the null space dimension (D-295) is small, so the 
recognition accuracy drops greatly. Random subspace further reduces the null space 
dimension and deteriorates the overfitting problem of N-LDA. 




Fig. 2. Recognition accuracy of combing 20 P-LDA classifiers constructed from random sub- 
spaces. For each 100 dimensional random subspace, the first 50 dimensions are fixed as the 50 
largest eigenfaces, and another 50 dimensions are randomly selected from the remaining 539 
eigenfaces with non-zero eigenvalues. 




Fig. 3. Recognition accuracy of combining 20 N-LDA classifiers from random subspaces with 
different dimensions using majority voting. 



4.2 Bagging LDA 

Figure 4 reports the performance of bagging based N-LDA. We generate 20 repli- 
cates and each replicate contains 300 training samples. The individual N-LDA classi- 
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fier constructed from each replicate is less effective than the original classifier trained 
on the full training set. This is because that some intra-class variations are not in- 
cluded in each replicate. However, when the multiple classifiers are combined, the 
accuracy is significantly improved, and becomes much better than the standard N- 
LDA. Table 5 reports performance of bagging based N-LDA using different number 
of replicates, but fixing training sample number in each replicate as 300. As similar in 
Table 4, it is more stable using a relatively large number of replicates. In Figure 5, we 
fix the bagging replicates number as 20, but change the training sample number con- 
tained in the replicates from 100 to 500. The best performance is achieved using 
proper moderate training sample number in each replicate. When the training sample 
number in each replicate is too small, the null space cannot effectively remove the 
intra-class variation. When the training sample number in each replicate is too large, 
the null space dimension is too small to contain enough discriminative information, 
and different replicates are similar. 



Table 5. Recognition accuracy of combining N-LDA classifiers using different number (K) of 
bagging replicates (sum rule). We run ten times on the same training set and testing set, and 
record the accuracy means and variances. 



K 


5 


10 


15 


20 


25 


30 


Mean 


0.929 


0.934 


0.942 


0.956 


0.951 


0.961 


Variance 


0.0120 


0.0109 


0.097 


0.009 


0.036 


0.027 



We also study using bagging to improve P-LDA classifiers. The PCA subspace is 
spanned by the 100 largest eigenfaces and 20 replicates are generated. The accuracies 
with the replicate containing different number of people are shown in Figure 6. As 
expected, the combined classifier shows no improvement over the original P-LDA 
classifier. In each replicate, the P-LDA classifier is constructed from an even smaller 
number of training samples. It deteriorates the small sample size problem. 



4.3 Integrating Random Subspace and Bagging Based LDA 

Integrating the multiple P-LDA classifiers generated by random subspace and N- 
LDA classifiers generated by bagging, the recognition accuracy can be further im- 
proved. We combine 10 P-LDA classifiers constructed from random subspaces and 
10 N-LDA classifiers constructed from bagging replicates, and set an even better 
result as shown in Table 6. 



Table 6. Compare random sampling based LDA with conventional LDA approaches. R-LDA 
(1): random subspace based LDA; R-LDA (2): bagging based N-LDA; R-LDA (3): integrating 
random subspace and bagging based LDA 



PCA-hLDA 


N-LDA 


R-LDA (1) 


R-LDA (2) 


R-LDA (3) 


0.929 


0.919 


0.961 


0.956 


0.976 
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Fig. 4. Recognition accuracy of combining 20 N-LDA classifiers constructed from bagging 
replicates. 




Fig. 5. Recognition accuracy of combing 20 N-LDA classifiers with different number of train- 
ing samples contained in the bagging replicates (sum rule). 




Fig. 6. Recognition accuracy of combining 20 P-LDA classifiers constructed from bagging 
replicates containing different number of training samples. The PCA space is spanned by 100 
largest eigenfaces. The combining rule is majority voting. 
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5 Conclusion 

Both P-LDA and N-LDA encounter the overfitting problems in when dealing with the 
high dimensional data classification, however, for different reasons. So we improve 
them using different random sampling approaches, sampling on feature for P-LDA 
and sampling on training samples for N-LDA. The two kinds of complementary clas- 
sifiers are finally integrated in our system. The extensive experimental study on the 
XM2VTS database illustrates the effectiveness of our method and how it works. 
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Abstract. We consider the problem of face verification using multichan- 
nel image data where each channel serves as the input to a separate face 
verification expert. By decorrelating the information content of the re- 
spective data channels, we enhance the diversity of the resulting face 
verification experts as well as the performance of the multiple classifier 
system. 



1 Introduction 

One of the keys to success in constructing multiple classifier systems is to ensure 
that the component classifiers can provide complementary information for the 
fusion process. Such classifiers are deemed to exhibit diversity which is beneficial 
not only from the point of view of reducing the variance of the combined decision 
rule, but may even lead to the reduction of the inherent ambiguity between class 
populations. 

There are a number of mechanisms that have been proposed to achieve this 
objective. In situations when the component classifiers are designed indepen- 
dently, their diversity can be maximised by the process of clustering [2, 6] which 
will help to identify experts with similar behaviour. The classifiers selected for 
fusion are picked as the representatives of the detected clusters. This procedure 
guarantees that the number of component classifiers used in the ultimate mul- 
tiple classifier system is as small as possible and thus controlling the number of 
degrees of freedom available to the fusion decision rule. 

The diversity of the component classifiers can be imposed more effectively 
by means of boosting [10] where the successive designs are based on resampled 
training sets which focus on the samples misclassified by the experts constructed 
earlier. This is a powerful concept but in practice its success depends on a careful 
trade-off between the performance of the component classifiers on the training 
set and their ability to generalise. 

One of the easiest ways to achieve diversity is to deploy multiple sensors 
which themselves gather complementary information about the objects to be 
recognised. In the context of personal identity authentication, multiple biometric 
modalities such as face, voice characteristic, iris and finger print all bring to bear 
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a completely independent piece of evidence on the authentication problem. The 
fusion of such multimodal experts has been shown to enhance the authentication 
performance significantly. 

Some systems, such as a camera, are multiple sensor devices. Although they 
provide complementary information about the spectral properties of the imaged 
objects, the outputs of the respective sensor channels are not necessarily indepen- 
dent as they reflect not only the object’s colour but also its shape. Nevertheless 
it has been demonstrated in a number of studies, that the decision level fu- 
sion of experts utilising such multispectral information can lead to performance 
improvement [11,5]. 

This paper also addresses the problem of face verification using colour im- 
ages. However, in contrast to the above work, here by analysing the underlying 
physical process of image formation, we shall show that the information content 
of the raw spectral images can be mapped into new image spaces which focus 
on complementary information content of the imaged scene. It will be demon- 
strated that by adopting the intensity image, intensity normalised green and 
opponent colour channels we will separate the imaging effects of object shape 
and object albedo, and create complementary image data channels that lead to 
face experts with an enhanced degree of diversity. The fusion of these experts 
will result in significant improvements in performance over the system in which 
the face experts work with the raw R,G,B channel data. 

The paper is organised as follows. In the next section we develop the physics 
based transformation that orthogonalises the R,G,B channel content. The face 
verification system used in the study is described in Section 3. The experimental 
set up is detailed in Section 4. Section 5 presents the results of the experiments. 
Finally, in Section 6 the results are discussed and the paper is drawn to conclu- 
sion. 

2 Physics Based Multichannel Image Conditioning 

Let us consider an image acquisition system deploying a conventional colour 
camera. We assume that the imaged objects have a Lambertian surface. This 
assumption is reasonably well justified in the case of faces. We shall ignore the 
effect of interface reflection, which would distort only a small part of the face 
image due to the saturation of the camera. In any case, in regions giving rise 
to total reflection, the spectral content of the reflected light would be domi- 
nated by the illuminant, rather than the face skin, and would not provide useful 
information for discriminatory purposes. 

Suppose the scene is illuminated by spatially invariant illumination source of 
spectral distribution e(A) where A represents the wavelength of the incident light. 
Under the Lambertian assumption, the light emitted from the scene will be a 
function of the material properties of the scene objects, albedo, which we denote 
by a{x, y, A) and the relative angles between the direction of illumination and 
the normal to the surface patch imaged by the (a;,y) pixel of a camera sensor. 
The effect of the geometry will be to scale down the incident light by a factor 
s{x, y). The output of the sensor at pixel position [x, y) will then be given by 
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p\2 

Ik{x,y)= pk{X)a{x,y,X)s{x,y)e{X)dX (1) 

JXi 

where Pk{X) is the spectral response of the sensor and Ai, i = 1,2 are limits of 
the visible frequency spectrum. A typical colour camera will have three spectral 
channels, normally measuring the reflected light in the scene in the red, green 
and blue parts of the visible light spectrum. They are referred to as the R,G and 
B channels. 

The majority of face recognition and verification systems use the intensity 
image, which is obtained by summing up the outputs of the R,G,B channels, i.e. 

/ A 2 3 

'^Pk{X)a{x,y,X)e{X)d\ (2) 

■1 k=l 

Assuming that ^k^iPk(X) = ci, i.e. the combined response of the sensors is 
flat, and that the illuminant also has a broad flat spectrum which can be ap- 
proximated by e(A) = C 2 , the intensity image acquired by the camera will given 

by 

I{x,y) = ciC2s{x,y) a{x,y,X)dX (3) 

Aq 

which can be written as 



I{x,y) = L{x,y)A{x,y) (4) 

where L{x,y) is the intensity of the incident light and A{x,y) is the reflectance 
property of the surface material. 

Under the above assumptions, each channel will produce output 

Ik{x, y) = L{x, y)Ap^ {x, y) (5) 

where Ap,^{x,y) = pk{X)a{x,y, X)dX is the reflectance property relative to 
the sensor spectral band. If the sensor spectral characteristic is constant 
over the spectral band and zero elsewhere, Ap,^{x,y) = Ak{x,y) will represent 
the intrinsic property of the material. If Pk(X) varies over the spectral band, the 
intrinsic albedo will be modulated by the sensor characteristic, i.e. the measured 
reflectance properties Ap^.{x,y) will be sensor specific. 

It should be noted that the output of each channel is a function of the scene 
geometry. If geometry is the dominant factor, as it will be for 3D objects, the 
three channels will be highly correlated. Any decision making scheme which 
attempts the fuse the three data sources will be affected by these correlations. 
Gomparing formulas (4) and (5), we can suppress the effect of geometry by 
dividing (5) with (4) which leads to intensity normalised channels (normalised 
chroma channels r, g, h) 
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Thus the normalised channels measure sensor specific albedo (i.e. proper- 
ties of the material) uncorrupted by object geometry. For an object with slowly 
changing albedo, the normalised channel will show it as of constant intensity, 
defining its shape in terms of its occluding boundary. In contrast to the nor- 
malised channels, the intensity image manifests the changes in surface shape 
and provides, therefore, complementary information about scene objects. 

The three normalised channels sum up to 1, hence only two of them contain 
useful information. More over, the information content of a pair of channels is 
likely to be correlated. In order to emphasise differences between the channels, 
rather than using the channels themselves, it should be beneficial to use one of 
the normalised colour channels and create one difference channel, by subtracting 
two normalised channel images, i.e. 

hj{x,y) = Ij{x,y) - Ik{x,y) (7) 

and shifting and rescaling to obtain non-negative pixel values. This corresponds 
to the idea of opponent colour spaces. The resulting representation should then 
maximise the complementary information content that should aid successful fu- 
sion. 

3 Face Verification Process 

The face verification process consists of three main stages: face image acquisition, 
feature extraction, and finally decision making. The first stage involves sensing 
and image preprocessing the result of which is a geometrically registered and 
photometrically normalised face image. Briefly, the output of a physical sensor 
(camera) is analysed by a face detector and once a face instance is detected, the 
position of the eyes is determined. This information allows the face part of the 
image to be extracted at a given aspect ratio and resampled to a pre-specified 
resolution. The extracted face image is finally photometrically normalised to 
compensate for illumination changes. 

The raw colour camera channel outputs, R{x,y), G{x,y) and B{x,y) are 
converted according to (2) and (6) into intensity image I{x,y), the normalised 
chroma channels r(x, y), g{x, y), b{x, y) and into opponent chromaticity channels 
by taking pairwise differences of the normalised chroma channels. 

rg{x,y) = r{x,y) - g{x,y) (8) 

yb{x, y) = r{x, y) + g{x, y) - 2h{x, y) (9) 

The chromaticity and opponent images are appropriately shifted and rescaled. 

In the second stage of the face verification process the face image data is 
projected into a feature space. We opted for the Linear Discriminant Analysis 
(LDA) feature space. The final stage of the face verification process involves 
matching and decision making. Basically the features extracted for a face image 
to be verified, x, are compared with a stored template, that was acquired on en- 
rolment, /Xj. The comparison is carried out using the isotropic gradient direction 
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metric, s, which was shown to outperform the Euclidean metric and normalised 
correlation function [9]. It is defined as 



ll(x- Mi)^VjP(t|x)|| 
l|V/P(*|x)|| 



V/P(t|x) = p(x|j)(/Xj -/xj (10) 

j = 1 
j * 



where V/P(i|x) denotes the gradient direction assuming an isotropic structure 
for the covariance matrix of the clients distribution, p(x|j). 

The score, output by the matching process, s, which measures the degree 
of similarity between the test image and the template, is then compared to 
a threshold, rj, in order to decide whether the claim is genuine (class uJa) or 
impostor (class ujb) i.e. 

s(x )<?7 (11) 

UJb 

where (x) is the extracted LDA feature vector. If this final stage of processing 
is applied to each data channel separately, we end up with a number of scores, 
Sfc = s(xfe), k = which then have to be fused to obtain the final 

decision. Any combination of the results of processing at this level is referred to 
as decision level fusion. In fact, the problem now is to find a function / so that 
the decision rule 

tOa 

/(si,S2,...,Sat) ^ 7] (12) 

LOb 

leads to a higher verification performance. This kind of fusion is also known 
as the confidence level or soft fusion [7]. Since the adopted experts in our case 
deliver a similar level of accuracy, their combination should either attach the 
same weight to all the scores or have a mechanism for selecting the best score. 
Thus in this study, two simple combining strategies have been considered. In the 
first method, we use samples average as the final score, 

/ (si; S2) • ■ • , Sn) = ^[si + S2 + . . . + Sn] (13) 

The second method is to select the best score as the final score, i.e. 

/ (si, S 2 , . . . , sn) = min (si, S 2 , ■■.,sn) (14) 



4 Experimental Design 

The aim of the experiments is to show that by decorrelating the sensory data 
used by component experts, the performance of the multiple classifier system 
improves considerably. We use the BANCA database and its associated experi- 
mental protocols for this purpose. 
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4.1 BANC A Database 

The BANCA database contains 52 subjects (26 males and 26 females). Each 
subject participated to 12 recording sessions in different conditions and with 
different cameras. Sessions 1-4 contain data under Controlled conditions while 
sessions 5-8 and 9-12 contain Degraded and Adverse scenarios respectively. Each 
session contains two recordings per subject, a true client access and an informed 
imposter attack. For the face image database, 5 frontal face images have been 
extracted from each video recording, which are supposed to be used as client 
images and 5 impostor ones. In order to create more independent experiments, 
images in each session have been divided into two groups of 26 subjects, 13 males 
and 13 females. Fig. 1 shows a few examples of the face data. 

In the BANCA protocol, 7 different distinct experimental configurations have 
been specified, namely. Matched Controlled (MC), Matched Degraded (MD), 
Matched Adverse (MA), Unmatched Degraded (UD), Unmatched Adverse (UA), 
Pooled test (P) and Grand test (G). Table 1 describes the usage of the different 
sessions in each configuration. “T” refers to the client training while “C” and 
“I” depict client and impostor test sessions respectively. The decision function 
can be trained using only 5 client images per person from the same group and 
all client images from the other group. More details about the database and 
experimental protocols can be found in [1]. 




Fig. 1. Examples of the database 
images, a: Controlled, b: Degraded 
and c: Adverse scenarios. 



Table 1. The usage of the different sessions 
in the BANCA experimental protocols. 
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4.2 Experimental Setup 

The original resolution of the image data is 720 x 576. The experiments were 
performed with a relatively low resolution face images, namely 64 x 49. The 
results reported in this article have been obtained by applying a geometric face 
registration based on manually annotated eyes positions. Histogram equalisation 
was used to normalise the registered face photometrically. 

The feature selection process is performed using the linear Discriminant Anal- 
ysis (LDA). The XM2VTS database [4] was used for calculating the LDA pro- 
jection matrix. In this study, the isotropic Gradient Direction metric [9] (GD) 
was used as the scoring function. The thresholds in the decision making system 
were determined based on the Equal Error Rate criterion, i.e. where the false 
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Table 2. ID verification results using the GD metric in the R, G and B colour spaces. 
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rejection rate (FRR) is equal to the false acceptance rate (FAR). The thresh- 
olds were set either globally (GT) or using the client specific thresholding {CST) 
technique [3]. As in the training sessions of the BANCA database, only 5 client 
images per person are available, in the case of global thresholding method, all 
these images are used for training of the clients template. The other group data 
is then used to set the threshold. While using the client specific thresholding 
strategy, only two images are used for the template training and the other three 
along with the other group data are used to determine the thresholds. Moreover, 
in order to increase the number of data used for training and to take the errors 
of the geometric normalisation into account, 24 additional face images per each 
image were generated by perturbing the location of the eyes position around the 
annotated positions. 

5 Experimental Results 

Table 2 shows the performance of the face verification system considering the 
individualR,G,R colour channels. Table 3 also contains the corresponding results 
using the intensity(z), chromaticity (r, g, h) and opponent chromaticity {rg,yh) 
spaces individually. In these tables, based on our previous study reported in 
[8], global (GT) and client specific (CST) thresholding techniques were used for 
unmatched (Ud, Ua, P) and matched (Me, Md, Ma, G) protocols respectively. 
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Table 3. ID verification resnlts on BANCA configurations using GD metric in the 
intensity (i), chromaticity (r,g,b) and opponent chromaticity (rg,yb) spaces. 
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The values in the table indicate the FAR, FRR and Total Error Rates (TER), 
i.e. the sum of false rejection and false acceptance rates. In general as one would 
expect, for matched protocols the performance is better than for the unmatched 
protocols due to generalisation problem posed by the latter. There appears to 
be no outright winner among the considered image spaces. A comparison of the 
results in different spaces shows that although in some cases the chromaticity 
or opponent spaces are superior, the most consistent results are obtained in the 
intensity space. The main exception here is the Ua protocol where significantly 
better performance is obtained in the normalised green space (g). Moreover, 
among the chromaticity spaces, the (g) space gives the best performance overall. 
These results also demonstrate that among the opponent chromaticity spaces, 
the rg space leads to better results. Therefore, for the fusion study, we com- 
pare the original R,G,B spaces with the intensity (z), normalised green (g) and 
normalised red-green (rg) spaces. 

To investigate the effect of combining different colour channel classifiers, we 
adopted the two simple methods of fusion, the average (Decision 1) and the best 
(Decision 2) rules, discussed in Section 3. The associated results are shown in 
Table 4. In this table and T2 refer to combining %g,rg and R,G,B spaces 
respectively. 
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Table 4. Decision level fusion results using the averaging and the best rules. J-1: i,g,rg, 
J-2'.R,GJ3 channels. 
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These results firstly show that a better performance is obtained by combining 
the different channel classifiers. Moreover, they clearly demonstrate that in all 
cases except the Md protocol a significantly better verification rate is achieved 
by fusing the intensity, chromaticity and opponent chromaticity spaces classifiers 
(IFI). This is due to the enhanced uncorrelatedness of the i,g,rg spaces. The poor 
quality of the IFI system for the Md protocol is due to the very weak verification 
performance of the g channel in this scenario. Finally, one can see that between 
the adopted fusion methods, score averaging is superior. 

6 Conclusions 

We have investigated the benefits of decorrelating R,GJ3 colour camera channels 
by separating the effect of object shape and surface material properties. The 
study was carried out in the context of face verification. The Banca database 
and the associated protocols were used for performance characterisation. 

Individually, the component face verification experts in the transformed 
colour space (intensity, chromaticity and opponent chromaticity channels) were 
not necessarily better than those using the originali?,G',B images. Overall, the 
performance in the intensity space was more consistent than that of any other 
channel, but not sufficiently better to justify using a gray level rather than colour 
camera. However, the advantage of using the physics based methods of decorre- 
lating the sensory data became apparent in the context of multiple expert fusion. 
Using the i, g and rg component experts in a multiple classifier system improved 
the performance over the Rpp colour channel expert fusion significantly. Taking 
the intensity space as the baseline, in the case of the controlled scenario (Me) 
an improvement by a factor of three was obtained. Similar gains were achieved 
for the G protocol. 
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Abstract. Recent works about perceptron-based fusion of multiple fingerprint 
matchers showed the effectiveness of such approach in improving the perform- 
ance of personal identity verification systems. However, to the best of our 
knowledge, no previous work investigated such fusion approach when stringent 
requirements in terms of verification errors are given, and the number of avail- 
able samples for perceptron training is small. Such investigation can allow to 
understand for which kind of applications such fusion rule can he useful. Re- 
ported experiments, based on two benchmark data sets, show that perceptron- 
based fusion can be useful for high security fingerprint verification applications, 
and it is effective in small-sample-size realistic cases. 



1 Introduction 

Fingerprints are widely used to automatically grant or deny the access to restricted 
areas (personal identity authentication or verification) [1]. The person to be authenti- 
cated submits to the system her/his fingerprint and declares her/his identity. The sys- 
tem matches the given fingerprint with the one stored in its data base and associated 
to the claimed identity. A matching score, i.e., a similarity degree among the two 
fingerprints, is computed. The higher such score, the higher the degree of similarity. 
The final decision about the claimed identity is performed by a threshold-based ap- 
proach. In particular, if the score is higher than the given acceptance threshold, the 
claimed identity is accepted, and the person is classified as a genuine user. Otherwise, 
she/he is rejected, and classified as an impostor. 

The so-called verification errors strictly depend on such threshold. In particular, 
the rate of impostors accepted by the system, also called false acceptance rate (FAR), 
and the rate of genuine users rejected by the system, also called false rejection rate 
(FRR), are the two performance evaluation parameters by which the effectiveness of 
the acceptance threshold is checked. 

It is worth noting that high security applications, as the access control to nuclear 
power stations, require that the FRR/FAR value has to be the lowest as possible for a 
fixed value of FAR/FRR, usually very low too. As an example, the FAR value could 
be fixed to 1%. The FRR value should be minimized by keeping FAR=1%. However, 
it is very difficult to design a fingerprint matcher able to meet such stringent require- 
ment. Usually, the resulting FAR or FRR value, by keeping the other parameter fixed 
to 1%, is unacceptable for the given application. 
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In order to improve the performance of automatic fingerprint verification systems 
[2], the decision-level fusion of multiple fingerprint matchers has been recently pro- 
posed, and such research topic is still very active [3-7]. Among the other approaches, 
the decision-level fusion by perceptron neural net has been investigated [4-6]. In par- 
ticular, it has been shown that such kind of fusion can outperform the best individual 
matcher, and also some fixed fusion rules [5-6]. However, verification performances, 
when a small data set is available for perceptron training, have not yet been investi- 
gated in detail. It is worth noting that the small sample size problem is a realistic con- 
dition for any automatic verification system. This becomes a crucial issue under high 
security requirements. 

In this paper, we experimentally investigated the performance improvement achie- 
vable by perceptron-based fusion of two well-known fingerprint matchers, when 
stringent requirements in terms of FAR and FRR are used, and the number of avail- 
able training samples decreases. Results on two widely used benchmark data sets, 
namely, the FVC2000-DB1 and the FVC2000-DB2 data sets [2], are reported. 

The paper is organized as follows. Section 2 describes the selected individual fin- 
gerprint matchers. Section 3 describes our fusion approach, by focusing on the per- 
ceptron-based fusion rules. Section 4 reports experimental results. Section 5 con- 
cludes the paper. 

2 The Selected Individual Algorithms for Fingerprint Verification 

A fingerprint verification algorithm, also called “matcher”, performs the comparison 
between the fingerprint submitted by the person to be authenticated, and the one 
stored in its data base (the so-called “template”). The comparison is performed by the 
matching algorithm, which provides a similarity degree among fingerprints. Such 
similarity degree is a real value in [0,1], named “score”. 

The literature reports two main kinds of approaches to fingerprint verification: the 
minutiae-based ones and the filterbank-based ones [3, 8, 9]. The first ones try to ex- 
ploit the local features of fingerprints, while the second ones are focused on global 
features. As results achieved in other application fields support the hypothesis that 
combining information coming from different algorithms can substantially improve 
fingerprint verification performances, we selected one algorithm for each kind. For a 
review about other fingerprint verification approaches, the reader is referred to Ref. 1 . 

The minutiae-based matchers are based on the location and orientation of the so- 
called minutiae points. The minutiae-points are the terminations and the bifurcations 
of the ridge lines (Figure 1). The orientation of a minutia point is defined as the local 
orientation of the ridge which the minutia belongs to. 



Bifurcation 




Termination 



Fig. 1. The so-called minutiae-points: bifurcations and terminations of the ridge lines. 
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The so-called “String” algorithm is a widely used minutiae-based approach [8]. In 
the following, we summarize such algorithm. Let X be the template minutiae set. Let 
Y be the input minutiae set. For each minutia xe X , the following algorithm is per- 
formed. For each y e T , x is aligned to y. Such alignment implies the roto-translation 
of the input minutiae-set while x and y match perfectly. Let A(x, y) = 
{(x; , y,- ), jc, e Y :aligned{x^,yi)=true\ be the set of other couples of aligned 

minutiae, x. and y, are considered as aligned on the basis of a pre-defined “minutiae 
distance” not exceeding a certain fixed threshold. At the end of such loops, the value 

max|A(x, y)|} is converted into a matching score by the formula: 

x,y ^ 

(max|A(.r, y)|})^ 
score = 1 I I I — ^ — 

\x\-\y\ 



The filterbank-based matchers try to describe the fingerprint texture by processing 
the fingerprint image with bank of filters aimed to enhance the different orientations 
of the ridge flow (Figure 2). 




Fig. 2. (a) Original fingerprint image, (b) Filtered fingerprint images enhancing three different 
ridge lines orientations by three Gabor filters. 



In particular, we used the so called “Filter” algorithm [9], which is based on a set 
of Gabor filters. A Gabor filter is a band-pass filter with orientation-selective charac- 
teristics. It has been shown that such filters are very effective to enhance the finger- 
print texture [10]. In [9], a tessellation of the fingerprint image is firstly defined. The 
tessellation is constituted by a circle subdivided in a certain number of sectors. Such 
circle is centered on the fingerprint “core”'(Figure 3). 





Fig. 3. The main steps of the fingercode feature vector generation. A circle decomposed in a 
certain number of sectors is centered on the core of the fingerprint image. A set of Gabor filters 
is applied to such tessellated image. The standard deviation of the gray level values of each 
sector of the filtered image are computed, so generating the fingercode feature vector [9]. 



' The fingerprint core is the point around which many ridge lines converge (Figure 3). 
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A set of Gabor filters is applied to such tessellation, so producing a set of filtered 
and tessellated images. The standard deviation of the grey values in each sector of 
such images is computed. The obtained values constitutes a feature vector named 
“fingercode” [9]. By such approach, both template and input fingerprints are repre- 
sented by a fingercode. So, the matching phase can be performed by simply comput- 
ing the Euclidean distance among such fingercodes. The obtained distance is finally 
converted into a matching score. 



3 Decision- Level Fusion of the Selected Fingerprint Matchers 

The proposed fusion scheme is outlined in Figure 4. 




Fig. 4. The proposed decision-level fusion scheme of the two selected fingerprint matchers. 

Let and be the matching scores provided by the minutiae-based and the filter- 
bank-based matching algorithms, respectively: 

- Apply the following transformation to the above scores and to implement the 
fusion: 

- Compare the obtained score value s with a threshold. The claimed identity is classi- 
fied as “genuine user” if: 

s > threshold (3) 

otherwise it is classified as “impostor”. 

It is easy to see that the above methodology can be also used for the case of more 
than two matchers. 

Among the fusion rules which can implement eq. (2), the most simple ones are the 
so-called fixed fusion rules [11]. This term derives from the observation that such 
rules require no training parameters. In other words, eq. (2) can be implemented by 
simply combining the given matching scores without other kind of information. 
Among fixed rules, we selected the mean rule: 

^ ^ +^f (4) 



and the product rule: 



2 
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s = 



( 5 ) 



The perceptron-based fusion rules belong to the class of the so-called trained rules. 
In fact, eq. (2) is implemented by the logistic transformation: 

1 (6) 

1 + exp[- (wq +w^s^+W 2 Sf^ 

The logistic transformation parameters (w„, Wj, w^), also called weights, are usually 
computed by a gradient descent algorithm with a cross-entropy loss function, or with 
a sum square error loss function [12]. 

Recently, Marcialis and Roli [6] proposed a novel approach to address the weights 
optimisation problem by using the following loss function: 



FD = 





a 



2 

gen 



+ at 



(7) 



The above loss function is called Fisher distance. In eq. (7), and , 

(J? „ are the mean and the variances defined as follows: 






T'i = — 



r(') 



ct.2 = 



i=l 



M- 



^ X ^ J ^ igen, imp} 
'' ^ y=i 



( 8 ) 



being the j-th output of eq. (6) for class i. Our learning algorithm looks for 

weights of logistic transform that maximise the Fisher distance by a gradient-descent 
algorithm. Accordingly, the distributions of genuine and impostor classes will be 
separated in terms of distance between their means, while their variances will be re- 
duced. The rationale behind such optimisation approach can be summarised as fol- 
lows. If genuine and impostor classes are well separated, errors that usually affect 
estimates of class distributions have a smaller impact on threshold selection, and con- 
sequently, on system performances. Further details and experiments showing the 
benefits of this fusion approach can be found in [6] . 



4 Experimental Results 

4.1 The Data Sets 

We used the FVC2000-DB1 and DB2 data sets, proposed for the Fingerprint Verifica- 
tion Competition [2]. Each data set is made up of 800 fingerprint images acquired by 
a different fingerprint image capture device. The number of identities is 100, and the 
number of images per identity is eight. 

The capture device used for collecting the DBl data set is an optical sensor. The 
optical sensor is characterised by a LED light source and a CCD placed on the side of 
a glass platen, on which the fingerprint to acquire is placed. The LED illuminates the 
fingerprint and the CCD captures the light reflected from the glass, enhancing ridges 
and valleys of the given fingerprint. The image size is 300x300 pixels at 500 dpi. 
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The capture device used for the DB2 data set is a capacitive sensor. The core of the 
capacitive sensors is the sensing surface, which is made up of a two-dimensional 
array of silicon capacitor plates. The second plates are considered to be the finger 
skin. The capacitance is dependent on the distance between the finger skin, i.e. ridges 
and valleys, and the plates. The captured image is derived from the capacitance meas- 
ures from each array element. The image size is 256x364 pixels at 500 dpi. 

Figures 5 and 6 show examples of fingerprint images from the DB 1 and DB2 data 
sets. 




Fig. 6. Examples of fingerprint images from the FVC2000-DB2 data set. 



4.2 Experimental Protocol 

Our experimental protocol is very similar to the one proposed in the FVC competi- 
tion [2]: 

• For each matcher, or any combination of the two selected matching algorithms, we 
computed two sets of scores. The first one is the “genuine-matching scores” set G, 
made up of all comparisons among fingerprints of the same identity (in our ex- 
periments, all images were compared with all the other images of a given identity). 
The second set is the “impostor matching scores” set /, made up of all comparisons 
among fingerprints of different identities. 

• We randomly subdivided the above sets in four parts, so that: G=G11JG2, I=I1\JI2. 
G1 and G2, as well as II and 12, are disjoint sets. 

• The training set Tr={Gl, II] was used to compute the weights of the perceptron- 
based fusion rules. 

• The test set Tx={G2, 12] was used to assess the performance of the individual and 
combined algorithms on unknown patterns. 

• We performed our experiments on five different training-test couples (i.e. on five 
different partitions of I and G): {Tr,, Tx,],. . .,{Tr^, Tx^], and averaged the results. 
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Let \Tr\ and |7 a:| be the size of the Tr and Tx sets, respectively. In order to simulate 
the effect of decrease of training set size, the perceptron has been trained and tested in 
three experimental conditions: 

(1) |2r|/|7i*:|=l, that is, Tr and Tx sets have the same size; 

(2) j2rj/|7i*:j=l/2, so that the Tx set size is twice as big as that of TV; 

(3) |2r|/|7i*:|=l/3, so that the Tx set size is three times as big as that of Tr. 

In all cases, TrlJTx = TUG. Individual algorithms and fixed fusion rules have also 
been tested according to the above protocol. 

The high security fingerprint verification performances of the individual finger- 
print matchers, and the different combinations of the two selected algorithms, were 
assessed and compared in terms of the following parameters: 

(a) 1%FAR, which is the rate of rejected genuine users (false rejection rate, FRR) 
when the rate of accepted impostors (false acceptance rate, FAR) is fixed to 1%; 

(b) 1%FRR, which is the FAR when the FRR is fixed to 1%. 



4.3 Results 

Tables 1-2 show the average 1%FAR and 1%FRR for the individual algorithms (sec- 
ond and third rows), the fixed fusion rules (fourth and fifth rows) and the perceptron- 
based fusion rules on the FVC2000-DB1 data set. Each column is related to the size 
ratio between training set and test set. 

It is worth noting that the best individual algorithm and the fixed fusion rules per- 
form worst by decreasing the training set size (second, third, fourth columns). In par- 
ticular, fixed fusion rules perform progressively worse than the best individual algo- 
rithm. On the contrary, the perceptron-based fusion allows to obtain a performance 
definitely better than that of the best individual matcher, even if the training set/test 
set size ratio decreases. As an example, the perceptron trained by cross-entropy loss 
function exhibits the most notable trade-off between 1%FAR and 1%FRR. A per- 
formance improvement of 2% over the best individual algorithm is pointed out for 
\Tr\/\Tx\ = 1/3 (Tables 1-2). 

Table 1. Average 1%FAR percentage values on the FVC2000-DB1 test set for decreasing 
values of |rr|/|7x|. In the first column, “Perceptron-CE” refers to the use of a cross-entropy loss 
function, “Perceptron-SSE” refers to the use of a sum square error loss function, “Perceptron- 
FD” refers to the use of the Fisher distance loss function. 





1%FAR 


|Tr|/|rx| = 1 


s 

II 


\Tr\/\Tx\ = 1/3 


String 


2.86 


3.21 


3.22 


Filter 


16.89 


17.25 


17.35 


Fusion by Mean 


3.69 


3.97 


4.13 


Fusion by Product 


1.78 


3.97 


4.13 


Fusion by Perceptron-CE 


1.72 


1.85 


1.70 


Fusion by Perceptron-SSE 


1.69 


1.88 


1.76 


Fusion by Perceptron-FD 


1.72 


1.86 


2.15 
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Table 2. Average 1%FRR percentage values on the FVC2000-DB1 test set for decreasing 
values of ITtI/ITjcI. In the first column, “Perceptron-CE” refers to the use of a cross-entropy loss 
function, “Perceptron-SSE” refers to the use of a sum square error loss function, “Perceptron- 
FD” refers to the use of the Fisher distance loss function. 





1%FRR 


\Tr\!\Tx\ = 1 


s 

II 


\Tr\/\Tx\ = 1/3 


String 


3.76 


5.01 


5.09 


Filter 


45.56 


46.56 


43.87 


Fusion by Mean 


17.54 


16.85 


13.54 


Fusion by Product 


2.38 


16.85 


13.54 


Fusion by Perceptron-CE 


3.65 


3.80 


2.91 


Fusion by Perceptron-SSE 


4.74 


4.07 


2.65 


Fusion by Perceptron-FD 


3.06 


3.05 


3.13 



Table 3. Average 1%FAR percentage values on the FVC2000-DB2 test set for decreasing 
values of Irrl/lTarl. In the first column, “Perceptron-CE” refers to the use of a cross-entropy loss 
function, “Perceptron-SSE” refers to the use of a sum square error loss function, “Perceptron- 
FD” refers to the use of the Fisher distance loss function. 





1%FAR 


|rr|/|rx| = 1 


s 

II 


\Tr\!\Tx\ = 1/3 


String 


4.42 


5.10 


5.37 


Filter 


35.76 


37.83 


37.47 


Fusion by Mean 


14.45 


14.85 


15.30 


Fusion by Product 


4.27 


14.85 


15.30 


Fusion by Perceptron-CE 


4.09 


4.36 


4.54 


Fusion by Perceptron-SSE 


4.06 


4.36 


4.62 


Fusion by Perceptron-FD 


4.12 


4.38 


4.60 



Table 4. Average 1%FRR percentage values on the FVC2000-DB2 test set for decreasing 
values of iTrl/lTxl. In the first column, “Perceptron-CE” refers to the use of a cross-entropy loss 
function, “Perceptron-SSE” refers to the use of a sum square error loss function, “Perceptron- 
FD” refers to the use of the Fisher distance loss function. 





1%FRR 


|rr|/|rx| = 1 


\Tr\!\Tx\ = Vi 


\Tr\!\Tx\ = 1/3 


String 


31.70 


43.90 


44.04 


Filter 


88.73 


87.40 


84.72 


Fusion by Mean 


60.47 


59.08 


56.46 


Fusion by Product 


18.89 


59.08 


56.46 


Fusion by Perceptron-CE 


18.16 


20.62 


21.57 


Fusion by Perceptron-SSE 


18.17 


20.18 


22.85 


Fusion by Perceptron-FD 


19.24 


19.83 


27.91 



Tables 3-4 show the average 1%FAR and 1%FRR for the individual algorithms 
(second and third rows), the fixed fusion rules (fourth and fifth rows) and the percep- 
tron-based fusion rules on the FVC2000-DB2 data set. Each column is related to the 
size ratio between training set and test set. 
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Even in this case, the performance improvement obtained by perceptron-based fu- 
sion rules are notable with respect to that of the fixed rules, which are not able to 
outperform the best individual matcher when the training set size decreases (third and 
fourth columns). In particular, perceptron-based fusion achieved an 1%FRR im- 
provement of 20% over the best individual matcher (Table 4). Such results are par- 
ticularly relevant because the 1%FRR assesses the verification performance when 
only the 1% of genuine users can be wrongly rejected. 

It is worth noting that the small sample size issue should advise against the use of 
more complex, trained, fusion rules for performances improvement, as remarked in 
[3]. Therefore, three main observations can be made on the basis of this remark and 
the reported results: 

- the individual algorithms do not exhibit satisfying performances under high secu- 
rity requirements (see in particular Table 4, second and third rows); 

- the use of fixed fusion rules can be inappropriate for improving performances, 
especially under the increase of the user population, that is, the reduction of the 
training set size w.r.t. the test set size (Table 1-4, fourth and fifth columns); 

- on the other hand, the perceptron-based fusion appears to exhibit the best trade-off 
among high security requirements, the increase of the user population, and the 
available sample size for weights optimisation. In addition, a perceptron-based fu- 
sion approach is very simple, so that the overall system complexity does not sig- 
nificantly increase. 



5 Conclusions 

In this paper, we experimentally investigated the performance improvement achiev- 
able by perceptron-based fusion of two well-known fingerprint matchers when strin- 
gent requirements in terms of FAR and FRR are required and the available training 
set is small. Experiments with two widely used benchmark data sets have been per- 
formed. Reported results showed that the perceptron-based fusion rules allow outper- 
forming the best individual matcher and some fixed fusion rules. This result is notable 
because realistic conditions for fingerprint verification systems, that is, small training 
sets, have been simulated. 

Although further experiments are needed to draw definitive conclusions, the per- 
ceptron-based fusion approach appears to exhibit the best trade-off among high secu- 
rity requirements, the increase of the user population and the available sample size for 
weights optimisation. 
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Abstract. We describe how an "in house’ classifier can enhance the perform- 
ance of a commercial ‘black box’ classifier using the classic serial multiple 
classifier combination scheme. It is now acknowledged by the classifier combi- 
nation community that parallel or hybrid decision fusion algorithms, in general, 
outperform serial combination schemes. However, classifier combination using 
techniques that use class labeling, ranking or probability estimators need access 
to low level information supplied by all of the participating classifiers. Unfor- 
tunately, in many commercial applications the classifier is often a ‘black box’, 
which implies that it is not possible to manipulate the low level information re- 
garding classification for these classifiers. In many such cases, a serial classifier 
combination model provides the only practical method to improve classifica- 
tion. In this paper, we present such an application in speech recognition. 



1 Introduction 

Commercial classifiers use a variety of ways to rank the input patterns to maximize 
recognition efficiency. Unfortunately, the optimization details are often not available, 
and in many cases are targeted to offer a performance best suited for an average user 
scenario. In other words, they are optimized to work reasonably well in a varied user 
environment. In many cases, optimizing these commercial recognizers for specific 
user requirements is extremely difficult, as most of the low level handles of the algo- 
rithms are not publicly available for manipulation. This scenario calls for using sec- 
ondary and preferably tertiary classifiers to enhance overall recognition performance. 
In many large commercial systems, unfortunately, it is not economically viable to 
design classifiers from scratch due to time constraints, unavailability of skills and/or 
spiraling development costs. In most of these cases, the solution is to acquire one 
reasonably good Commercial-Off-the-Shelf (COTS) classifier. These second (and 
may be the third) classifiers are usually selected from what is freely available, and 
most often than not, are derived from academic research. Combining these classifiers 
with the COTS classifier then provides a major challenge. 

In this paper we discuss such a practical scenario. The focus of the paper is to offer 
insights into how the applied research of the commercial world differs in its ap- 
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proaches from the rigorous research of the academic world. This paper is an attempt 
to show how it is possible to apply some of the classifier combination techniques, 
albeit within restricted domains, to streamline an often ad-hoc approach. The chosen 
scenario is treated as typical, since many commercial systems face similar demands 
for performance enhancement without the benefit of detailed access to low level pa- 
rameters. 

Specifically, we demonstrate how it is possible to use Support Vector Machines 
(SVMs) to optimize the performance of a commercial voice recognition engine (IBM 
ViaVoice™: http://www-3.ibm.com/software/voice/viavoice/) by using phonetic and 
other features (Fig. 1). 




Fig. 1. Re-ranking top choices from a commercial voice recognition engine using Support 
Vector Machine classifiers 

The rest of the paper is organized as follows. Section 2 introduces some terminol- 
ogy within the framework of a multiple classifier system (MCS) paradigm. Section 3 
presents a concise summary of previous research on commercial Automatic Speech 
Recognizers (ASRs). Section 4 introduces the proposed serial MCS. Section 5 pre- 
sents some results and finally, in Section 6, some conclusions are drawn and future 
work is discussed. 



2 MCSs: Some Terminology 

Within the framework of Multiple Classifier Systems (MCS), attempts have been 
made to categorize these systems based on their hierarchy. Three different decision 
combination topologies widely reported in the literature [1] are as follows: 

- Serial (Vertical) Combination Scheme: 

This realizes to a physical structure where the classifiers are applied sequentially. 

- Horizontal Combination Scheme: 

This realizes to a physical structure where the classifiers are applied concurrently 
and independently. 

- Hybrid Combination Scheme: 

In this case, a combination of serial and parallel combinations is constituted. 

In this paper, we focus of serial combination schemes. 



3 Previous Research 

Since this paper is focused on commercial Automatic Speech Recognizers (ASRs), 
we chose to review this area. IBM® Embedded ViaVoice™ is the leading Automatic 
Speech Recognition (ASR) software in the market (http://www-3.ibm.com/software/ 
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voice/viavoice/). The voice recognition engine is impressive, and the ability to use 
ViaVoice™ to develop applications for handheld PDAs makes it a valuable tool. 
However, there is room for improvement in the ViaVoice^M software and in the de- 
velopment model. SRI International offers an embedded ASR, DynaSpeak 
(http://www.dynaspeak.com/). Some of the newer ASRs have been designed specifi- 
cally for small screen devices with very optimized and therefore very small footprint. 
These include Fonix (http://www.fonix.com/). Dragon (http://www.caere.com/ 
naturallyspeaking/) and others. 

In this paper, we focus on the ranking of utterances that are candidates to be the top 
choices by the IBM® ViaVoice’'^ engine. Internally ViaVoice^M uses a set of pa- 
rameters to optimize this, and some of these parameters are exposed for manipulation 
with their SDK. However, these parameters are too generic and are not able to adapt 
themselves to specific user environments. In the rest of the paper we offer an alternate 
solution of post-processing using SVM classifiers as black-box solutions to re-rank 
the top choice offered by the ViaVoice^M solution and demonstrate that in many cases 
it is possible to produce significant improvements over the response of the 
ViaVoice™ working on it's own. We also demonstrate how this can result in much 
improved performance from the ViaVoice™ engine and subsequently to the overall 
recognition task. 

There has been significant amount of work done on serial combination of multiple 
classifiers. The space restrictions do not allow a full review of this area and the read- 
ers are referred to [2] for a comprehensive treatment. 

4 Serial Combination: IBM ViaVoice™ 
and Support Vector Machines (SVM) 

We propose a serial combination of the COTS IBM ViaVoice and a Support Verctor 
Machine (SVM). 



4.1 IBM ViaVoice™ 

IBM® Embedded ViaVoice^*^ is the leading Automatic Speech Recognition (ASR) 
software in the market. Currently, all expected input must be pre-specified in a .bnf 
file in Speech Recognition Command Language (CRCL) format. There are several 
disadvantages to this development model. The process of designing a grammar 
requires the programmer to use production rules to generate all of the possible 
allowable phrases for the application. While this approach works well with simple 
predetermined commands, ViaVoice^M is not able to recognize speech that varies 
slightly from the grammar. Furthermore, it is difficult to design a grammar in CRCL 
format that supports Natural Language (NL) input. Therefore, programmers using 
ViaVoice™ to develop an NL speech enabled application must have an in-depth 
background in linguistics. 

The process of using ViaVoice™ to create a speech-enabled application is com- 
plex, requiring the programmer to complete many steps in order to compile a simple 
“Hello World” program. Between setting environment variables, keeping track of 
annotation markers, and defining the AppEvent in 4 or more places, the ViaVoice^M 
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learning curve is quite steep. Also, ViaVoice'’'’^ is not easily compatible with the Mi- 
crosoft® Visual Studio environment, nor does ViaVoice’’’^ provide any development 
automation software. 

Fig. 2 shows a sample set of responses of IBM® ViaVoice™ to a typical input sen- 
tence. The input utterance is "Dial 5581230" and the top nine candidates are shown. 
The first part of the response shows the text part, the second number is an IBM- 
internal confidence score and the third number (0/1) is a Boolean response of accep- 
tance, where 'O' means rejection and T means acceptance. It is interesting to see that 
although the top choice was correct, the engine chose to reject it. In this example, no 
choice was accepted. 



Sentence: 

2 Dial 5SS1230 
Results: 

dial five five eight one two three 4242 242, 0 

dial five five eight one two three zero, 4241 223, 0 
dial five five eight one two two zero, 42S933S, 0 
dial five five eight one two two zero, 42603S4, 0 
dial five nine eight one two three zero, 4903933, 0 
dial five nine eight one two three zero, 4904952, 0 
dial five one eight one two three zero, 491 1 739, 0 
dial five one eight one hi’o three zero, 4912 752, 0 
dial five four eight one two three zero, 4913212, 0 



Fig. 2. Typical input utterance and response from the voice recognition engine 



4.2 Support Vector Machine (SVM) 

SVM is being extensively studied in recent years because of its ability to create very 
accurate decision boundary in multi-dimensional feature spaces [3-6]. We chose SVM 
as our secondary classifier because of its ability to generate robust global minima in a 
multi-dimensional feature space. The following is a generalized description of the 
principal concepts in an SVM based largely on [7]. 

A function /:R'^^{±1} is estimated from the training data, i.e. for V-dimensional 
patterns jc, and class labels j,, (x,, y,),..,(x, , y,) e R^X { (+/-)!}, such that/ will cor- 
rectly classify new examples (x,y) - that is,f(x) = y for examples (x,y), which were 
generated from the same underlying probability distribution P(x,y) as the training 
data. If no restriction is put on the class of functions that / is chosen from, a function 
that does well on the training data may not generalize well to unseen examples. As- 
suming no prior knowledge about/, the values on the training patterns carry no in- 
formation whatsoever about values on novel patterns. Therefore learning is not possi- 
ble, and minimizing the training error does not necessarily imply a small test error. 
Statistical learning theory, or VC (Vapnik-Chervonenkis) theory, shows that it is 
crucial to restrict the class of functions that the learning machine can implement to 
one with a capacity that is suitable for the amount of available training data. 

To design solid learning algorithms, a class of functions is required whose capacity 
can be computed. Support Vector classifiers are based on the class of hyperplanes 
(w.x) + b = 0, WG R^, b G R corresponding to decision functions f(x) = sign((w.x) + 
b). It is possible to show that the optimal hyperplane, defined as the one with the 
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maximal margin of separation between the two classes, has the lowest capacity. It can 
be uniquely constructed by solving a constrained quadratic optimization problem 
whose solution w has an expansion w VjX^ in terms of a subset of training patterns 
that lie on the margin. These training patterns, called support vectors (SV), carry all 
relevant information about the classification problem. One crucial property of the 
algorithm is that both the quadratic programming problem and the final decision func- 
tion depend only on dot products between patterns. This makes it easier to generalize 
this solution to the nonlinear case. SV machines map the data into some other dot 
product space (called the feature space) F via a nonlinear map (p.R^^F, and perform 
the above linear algorithm in F, which only requires the evaluation of dot products 
k(x, y)=( 0 (x). 0 (y)). If F is high dimensional, the right-hand side of this equation is 
very expensive to compute. In some cases, however, there is a simple kernel k that 
can be evaluated efficiently. 

For instance, the polynomial kernel k{x, y)=(x.y/ can be shown to correspond to a 
map F into the space spanned by all products of exactly d dimensions of /?*. For d=2 



and jc, ysR , for example. 






V2x,;Cj 



3-2^ 



defining 



;=(®(X)®().)) 



<^(x)={xl ,4lx^x^,xl)- More generally, it can be shown that for every kernel that gives rise 
to a positive matrix x )).. > ^ ^ constructed such that this equation 

holds. It is also possible to use radial basis function (RBF) kernels such as 
k(x,y)=ex-p{^\x-yf ^nd sigmoid kernels (with gain k and offset 0), 



k(x,y)=tanh(K (x.y)+ 0). Support Vector Machines use a nonlinear decision function 
of the form , . (O, , , We have adopted an implementation of 

f{x)=sign\^v,.k{x,x,)+b\ ^ ^ 

which is an implementation of Vapnik’s Support Vector Machine [8]. The optimiza- 
tion algorithm used in SVM‘‘‘'“ is described in [9]. The algorithm has scalable memory 
requirements and can handle problems with many thousands of support vectors effi- 
ciently. All SVMs discussed in this paper assumes a Dot Product function implemen- 
tation. 



4.3 Manipulating SVM Features 

What we are looking for here is a post-processing black-box that either confirms or 
rejects the decision of the ViaVoice™ engine. In case of rejection, it should re-rank 
the choices according to its best judgment. Fig. 3 shows an example utterance and the 
response generated by the engine. 

4.3.1 Signal Features 

The first column in Fig. 3 shows the candidates. The second column shows the Boo- 
lean decision by the engine. All the candidates are rejected in this case. The third 
column shows internal confidence scores of the engine. The fourth column shows the 
normalized scores. Finally, the fifth column shows the differences between two suc- 
cessive scores. Fig. 4 shows the graphical representation of the differences. These 
scores and their combinations are used as signal features for the SVM. 
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i/iew Jennifer Peters 






Results: 






i/iew Jennifer Peters 


0 


2600051 


0 0.699648 


iriew Jennifer DAVIDSON 


0 


2785921 


0.699648425 0.117657 


wew Julie Fruit 


0 


2817178 


0.817305448 0.024497 


view JASON Curtis 


0 


2823686 


0.841802742 0.051178 


call Jennifer Peters 


0 


2837282 


0.892980554 0.03279 


view JULIE Jones 


0 


2845993 


0.92577034 0.041632 


view Dennis Dent 


0 


2857053 


0.967402188 0.012 


view Stacy Perkins 


0 


2860241 


0.979402399 0.020598 


view Jamie Jaqqers 


0 


2865713 


1 



Fig. 3. SVM Features 




Fig. 4. SVM Features 
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State of the Glottis 

f Voiceless i Voiced 



Fig. 5. IPA definition of Standard American English - Consonants 



4.3.2 Phonetic Features 

We use International Phonetic Alphabet (IPA) definitions (http://www.ic.arizona.edu/ 
~lsp/IPA/SSAE.html) for generating phonetic features. Fig. 5 and Fig. 6 concentrate 
on the sounds of Standard American English. These charts are done in the standard 
IPA. Our phonetic features are based on manipulating these notations and values. 
More information is in Section 4.3.3. 



4.3.3 Final Combined Features 

Our features combined the signal and the phonetic features. The features we used 
included: 
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Front Cvntrdl Back 




Fig. 6. IPA definition of Standard American English - Vowels 



1. (ScorCj - Score^) - (Score^ - Score,), where (Score, is the highest score, the peak), 
score, is the next highest score etc. after normalization. 

2. (Score, - Score,) / <phonetic difference between T' and 2°“* choice> 

In order to calculate phonetic distance we find arrays of phonemes for both 
phrases, and then find a mapping of one array to the other, which gives minimal 
score. The overall score of a mapping is equal to the following: 

27 distances between all pairs of phonemes mapped to each other + 0.5 * all un- 
mapped vowels + all unmapped consonants 

The distance between two phonemes is calculated as follows: 

- If one of them is vowel, and the other is consonant, then the distance is 1 . 

- If both same, then the distance is derived from 

0.5 * sqrt (sum of squared differences of features) 

The vowel features include the 'Front-Back' and 'Low-High' (from Fig. 6). The 
consonant features include the Place of Articulation (POA) and the Manner of Articu- 
lation (MOA) (from Fig. 5). Each feature is mapped to a number from [0, 1] interval. 
For the "front-back" feature, the 'Front' is mapped to 0, and the 'Back' to 1. For the 
"low-high" feature, the 'Low' is mapped to 0, and the 'High' to 1. For the case of 
'POA', the feature 'Voiceless biliabial' is mapped to 0 and the feature 'Voiced glottal' 
is mapped to 1 . Finally, for 'MOA', the feature 'Stop' is mapped to 0, and 'Glide' - to 1 . 

The scoring and mapping process described here is totally novel and have not been 
used in other work that we have come across in our survey of related methods. 

5 Experimental Results 

We carried out tests with 400 (4 people x 2 modes x 50 sample each) commands. 
There were three distinct objectives for the tests: 

a. Can we improve the overall recognition rate by using the secondary SVM classifier 

- using the performance of the IBM engine as the baseline? 

b. Can we distinguish cases where the two classifiers are better off on their own? 

c. Can we identify groups based on their demography on which the classifiers per- 
form better? 
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Table 1. Overall recognition rates 





Tester 1 


Tester 2 


Tester 3 


Tester 4 


Average 


ViaVoice™ 


56% 


83% 


88% 


49% 


69% 


SVM-^ ViaVoice™ 


79% 


77% 


70% 


80% 


76% 



Table 1 shows the overall recognition rates for the combination versus the 
ViaVoice™ working on its own. It is very clear that the combination is able to per- 
form at a higher rate. On the whole, this pair-wise testing shows the following main 
themes: 

i. On the whole, the combination outperforms the IBM ViaVoice™ working on its 
own. 

ii. The combination is consistently better than IBM ViaVoice™ in confirming the top 
choice of the rank table. 

When the combination says something is correct, it is more frequently correct. 



Table 2. Definitions of false positives and negatives 





+ 


- 


+ 


___ 


False -ve 


- 


False -l-ve 


— 



But this is not the complete picture. In practical commercial systems, there are two 
other aspects of performance that need to be addressed. The first is the analysis of 
false H-ve and false -ve. Generally, users judge systems on the following two criteria: 

i. How many times a command is not accepted? 

ii. How many times a command is confused with another command? 

Table 2 defines how false positives and false negatives are counted. Table 3 sum- 
marizes the performance of the combination and the IBM ViaVoice™ engine focusing 
on false positives. Table 4 summarizes the performance of the combination and the 
IBM ViaVoice™ engine focusing on false negatives. 



Table 3. False positive result analysis 



SVM/IBM 


Tester 1 


Tester 2 


Tester 3 


Tester 4 


Tester 1 


____ 


34/52 


4/8 


48/17 


Tester 2 


16/50 


____ 


0/8 


28/17 


Tester 3 


8/50 


8/52 




10/17 


Tester 4 


10/50 


12/52 


0/8 





Table 4. False negative result analysis 



SVM/IBM 


Tester 1 


Tester 2 


Tester 3 


Tester 4 


Tester 1 


____ 


4/0 


80/0 


13/0 


Tester 2 


16/2 


____ 


86/0 


28/17 


Tester 3 


34/2 


24/0 




15/0 


Tester 4 


32/2 


16/0 


4/0 
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Based on these comparisons, the following observations can be made. 

i. The combination has much lower rate of false +ves than IBM ViaVoice™ on its 

own. 

- IBM ViaVoice™ confuses too many commands. 

- IBM ViaVoice™ rejects too many commands even though they are correctly 
placed at the top of the ranking table. 

ii. However IBM ViaVoice™ has performed better than the combination with respect 

to false negatives. 

When ViaVoice™ says something is incorrect, frequently it is incorrect. 

The discussion so far shows that in a practical application, the straightforward rec- 
ognition performance may not be the most appropriate yardstick. If reducing the false 
H-ves is deemed more important from the user point of view, then the combined sys- 
tem is obviously the preferred option. However, if reducing false negatives is consid- 
ered to be the priority, then the combination is not the obvious choice any more, al- 
though the combination does offer better overall recognition performance. 

In addition, it was also found that the combination performs better than the COTS 
system when the speakers are non-native English speakers. This might be due to the 
fact that an SVM is a global expected risk minimizer, and not a local optmizer based 
on training samples [10]. An SVM's effectiveness depends on the number of Support 
Vectors (SVs) within a close proximity to the separating hyperplane. Local training 
may provide too many SVs. This might be an important aspect when choosing to use 
the combination based on the target user demographic in commercial marketing. 

The final point to address is the cost. Deployment of any commercial system must 
address the effect of choosing a computation-heavy solution in place of the default 
solution, which in this case is the COTS IBM ViaVoice™ engine. There are two types 
of costs, the training cost and the execution cost. Since the cost of training is directly 
connected with the length of the ranking list (in Fig. 2, this length is 9), we chose 
three different scenarios for this comparison. Our experiments show that the cost of 
training goes up by 8.9% if the length is changed from 1 to 3. If the length is 9, mean- 
ing that 9 options are used for retraining, then the training cost is increased by 35.2%. 
In terms of execution cost, it has been found from our experiments that the combined 
system loses throughput by 14.2% when a 3-length ranked list is used. If the rank list 
has 9 entries, then the throughput decreases by 112.3%. In our experience, reducing 
the length of the ranked list from 9 to 3 does not produce any change in recognition 
performance. So for all practical purposes, the combined system will be 14.2% slower 
compared to the case when IBM ViaVoice™ is working on its own. This slowdown is 
not considered to be significant in terms of the user experience. 



6 Conclusions 

This paper has presented a novel method of calculating secondary ranking using con- 
fidence scores from a commercial voice recognition engine using SVM classifiers. 
The method is generic and since it does not depend on the way the engine works, is 
applicable to a wide variety of commercial voice recognition engines. The experimen- 
tation shows that adoption of SVM as secondary post-processing stage significantly 
increases the overall performance of the chosen voice recognition engine. 
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